COMP6210 – Big Data Assignment help
COMP6210 – Big Data Assignment 1
MapReduce
Semester 2, 2024
School of Computing, Macquarie University
Dataset:
The Olympic historical dataset “Olympic_Athletes.zip” is available on iLearn. This dataset contains information about athletes who participated in the Summer and Winter Olympic Games from 1896 to 2022.
Programming Environment:
• MongoDB & Studio 3T: Used for creating databases and importing datasets into collections. • Pymongo: Used for connecting to MongoDB and extracting information from documents within collections.
• Mrjob: Used for implementing MapReduce programs.
Task 1: Data Curation (20 marks)
• Task 1.1 – Data Extraction (10 marks): Extract information about medal-winning athletes in the Summer Olympics from the 1980 edition to the 2020 edition. Note that you need to verify whether the athlete participated in the Summer Olympics and whether they won a medal (gold, silver, or bronze). Then, extract the following values:
For each qualified athlete, create an entry in the format: <id, country, year, event, medal>. Store these entries in a text file named “athletes.txt”. Refer to the following screenshot for formatting (note that the screenshot is for reference purpose only; actual results may vary).
You can also use other delimiters, such as commas, semicolons, underscores, or quotation marks, to separate each line’s values. The “athletes.txt” text file will then serve as the input for the subsequent MapReduce programs.
• Task 1.2 – Data Organization (10 marks): Using the generated “athletes.txt” file as input, implement a MapReduce program to sort the data in ascending order based on the athlete id. The partial results of Task 1.2 are similar to the following screenshot (note that the screenshot is for formatting reference only; actual results may vary).
Note that for records with the same athlete ID, there is no specific requirement regarding their order.
Task 2: Data Analysis with MapReduce (60 marks)
Using the generated “athletes.txt” file as input, implement three MapReduce programs to complete the following analysis tasks.
• Task 2.1 (20 marks) Find the top three athletes who won the most number of medals in each category (gold, silver, and bronze) in 1980-2020. Firstly, you need to calculate the total number of medals each athlete has earned in gold, silver, and bronze categories, respectively. Next, sort the athletes in descending order based on their medal counts for each category. Finally, for each medal category, output the top three athletes along with their respective medal counts.
Note that there is no specific requirement regarding the order of medal categories. The partial results of Task 2.1 are similar to the following screenshot (the screenshot is for formatting reference only; actual results may vary).
• Task 2.2 (20 marks) Find the top three countries with the most number of gold medals in 1980-2020. First, you need to count the total number of each medal type (gold, silver, and bronze) for each country. Then, sort the countries in descending order based on their gold medal count. Finally, output the top three countries along with their medal counts for all medal types (gold, silver, and bronze).
The partial results of Task 2.2 are similar to the following screenshot (note that the screenshot is for formatting reference only; actual results may vary).
• Task 2.3 (20 marks) Find the top three events with the highest medal counts for each decade in 1980- 2020. Firstly, for each decade (e.g., 2010-2020, 1980-1989, etc.), you need to find the event with the greatest number of medals for each country, i.e., calculate the total medal count by summing gold, silver, and bronze medals. Then, within each decade, sort the events by their total medal count in descending order, and output the medal counts of the top 3 events (Note: The decades should be listed in descending order, and the medal counts for the top 3 events within each decade should be also sorted in descending order).
The partial results of Task 2.3 are similar to the following screenshot (note that the screenshot is for formatting reference only; actual results may vary).
Task 3: MapReduce Flowcharts (20 marks)
For the three MapReduce programs in Task 2, create a 3-4 page Word or PDF document that includes flowcharts to illustrate the process of each MapReduce program. You can use specific data examples to clarify the processes. Note that your flowcharts should be consistent with your code. You can refer to the following example provided in the Week 5 lecture notes as an example.
To create your diagrams, we recommend using the online diagramming tool. If you choose to use online tools, you can include the diagram link in your document.
Programming Environment:
You are required to use Python to complete all code-related parts of this assignment. The use of other programming languages is strictly prohibited and will result in a loss of all marks.
For Task 1.2 and Tasks 2.1-2.3, you are required to use only the MapReduce model to complete the corresponding tasks. The use of any additional Python programs is not allowed. Even if correct results are obtained, failing to use MapReduce will result in 0 marks.
Submission:
Submit a zip file named ‘FirstName_LastName_Assignment1.zip’ via iLearn. The submission should include the following items:
• Source code for Task 1: ‘task1_1.py’ and ‘task1_2.py’.
• Source code for Task 2: ‘task2_1.py’, ‘task2_2.py’, and ‘task2_3.py’.
• Output files for Task 1: ‘athletes.txt’ and ‘output1_2.txt’.
• Output files for Task 2: ‘output2_1.txt’, ‘output2_2.txt’, and ‘output2_3.txt’.