avatar

Zhongpu Chen

PhD

Fall 2020 (Distributed System and Big Data Management)

Lecture Location and Time

Room 109, Tongbo Building (通博楼). 18:30-21:15, every Tuesday from Week 10 to Week 19.

Syllabus

  • Week 10: Course intro + introduction to distributed computing and big data
  • Week 11, 12: Hadoop + HDFS
  • Week 13: Spark + Spark SQL
  • Week 14: Students’ presentations about their interested topics in terms of distributed system and big data
  • Week 15: Spark + Spark SQL
  • Week 16: Hive + HBase + ZooKeeper
  • Week 17: Lab
  • Week 18: Big streaming data processing + Students’ presentations about their final projects
  • Week 19: Students’ presentations about their final projects

Possible Topics for Week 14’s Presentations

  • Book review of Big Data: A Revolution That Will Transform How We Live, Work, and Think. Since the book has been published for nearly 10 years, some conceptions or ideas may sound unreasonable now. Please summarize the prons and cons of this book via critical thinking.
  • Data storage: Parquet as a use case.
  • Data serialization: Avro as a use case.
  • Introduction to Sqoop.
  • Introduction to Open MPI.
  • Introduction to CAP theorem.
  • Recent research papers on big data index.
  • Recent research papers on big data security and privacy.
  • Recent research papers on big data visualization.

If you are not interested in any topic mentioned above, please feel free to let me know. Remember: you have to obtain my confirmation before you start to prepare the presentations.

Note: 15 minutes presentation + 5 minutes Q&A for each group.

Final Project

Analyze the air quality in a distributed system, and then submit a paper in English/Chinese.

Data Format

The data (can be downloaded in Piazza) is collected from 2018-08-01 (0:00) to 2019-06-10 (23:00) at the rate of one hour in several cities, and there are nearly 400k of lines. Each line is in the format of (station_id, longitude, latitude, PM25, PM10, NO2, SO3, O3-1, 03-8h, CO, AQI, level, year, month, date, hour, city) which is separated by ,.

The following is a set of samples:

99000,115.49,38.88,43,68,21,20,104,104,0.6,60,2,2018,8,1,0,北京
99001,115.51,38.88,38,58,26,20,120,120,0.6,54,2,2018,8,1,0,北京
99002,115.47,38.91,50,72,22,17,113,113,0.7,69,2,2018,8,1,0,北京

Problems

  • Which city has the highest PM25 index, and which city has the lowest PM25 index? The specific metric can be decided by yourself, and you have to detail the reason.
  • Please report the air quality distribution of 北京, 上海 and 成都 throughout February in the year of 2019. To be specific, how many days were there being Good, Moderate, Unhealthy, Very Unhealthy and Hazardous?
  • Ask a non-trivial question by yourself, and then answer it.

Requirements

  • The data must be stored in HDFS, and you can use either Hive or Spark.
  • The source code must be public in GitHub.
  • Deadline: 23:59, 2021-01-20.
  • The project must be finished independently.

Course Forum

[email protected]