You are asked to carry out data acquisition, preparation and exploration steps based on the three data sources  according to the given instructions. For example, you need to develop and implement appropriate steps to load

Department of Computing Technologies 

COS60008 Introduction to Data Science  

Semester 1 2024 – Assignment 1  

Due: 23:59, Friday 15 March 2024 

Introduction 

This is an individual assignment and worth 15% of your final grade. It intends to evaluate your understanding and  practical skills to deal with the first few steps in a typical data science process.  

In this assignment, you are provided three data files, i.e., “data1.csv”, “data2.csv” and “data3.csv”, which form the  dataset created from a higher education institution related to students enrolled in different undergraduate degrees1.  The files “data1.csv” and “data2.csv” contain the same set of students but distinct sets of attributes for describing  the student, where each student has its unique ID. The file “data3.csv” contains a different set of students with each  student described by all attributes from both “data1.csv” and “data2.csv”.  

You are asked to carry out data acquisition, preparation and exploration steps based on the three data sources  according to the given instructions. For example, you need to develop and implement appropriate steps to load and  merge the data from the three data files, perform data cleaning, make explorative data analysis, and report your  findings.  

A discussion forum for the assignment will be available in Canvas. If required, further announcements about the  assignment will be posted in Canvas. You are responsible for checking Canvas on a regular basis to stay informed  with regards to any updates about the assignment.  

 

 

Academic Integrity  

The submitted assignment must be your own work, and any parts that are not created by yourself must be properly  referenced. Plagiarism is treated very seriously at Swinburne. It includes submitting the code and/or text copied  from other students, the Internet or other resources without proper reference. Allowing others to copy your work is  also plagiarism. Please note that you should always create your own assignment even if you have very similar ideas  with other students.  

Plagiarism detection software will be used to check your submissions. Severe penalties (e.g., zero mark) will be  applied in cases of plagiarism. For further information, please refer to the relevant section in the Unit Outline under  the menu “Syllabus” in Canvas and the Academic Integrity information at   

General Requirements  

This section contains the general requirements which must be met by your submitted assignment.  Marks will be deducted if you fail to meet any of the following general requirements.  

Use Python 3 & Notebook: You must complete Tasks 1 and 2 using the Jupyter Notebook format with a Python 3 kernel. 

Use a single notebook file: All code for Tasks 1 and 2 must be written inside a SINGLE notebook file (assignment1.ipynb). 

Include the header section of markdown: At the start of the notebook file, include and complete the following header section as a cell (for correct Semester). Remove the ( ) text by replacing them with your details. 

Include Task Headings: Before each task include an appropriate Markdown cell with the task label as a level 2 ## heading. For example. ## Task 1 – Data Acquisition & Preparation 

Use cells for sub-tasks: Create appropriate cells for sub-tasks within Tasks 1 and 2. 

• Don’t have a single cell with too much code that combines different sub-tasks. 

• Don’t have a single cell for every single line of python code. 

• It is your job to communicate effectively. 

Code Comments: You must include code-level comments in your assignment1.ipynb file to explain the key parts of your code. 

• If you do not have code comments that support your code answer, your mark will be reduced even if the code is correct. (Note that this is for KEY parts of your work, not every part of it.) 

• It is valuable to make your code comments unique so that your work is not like other students when assessed. Put things in your own words! 

• You do NOT have to explain every single line of code or things that are very easy for another programmer to understand. 

Graphs are Clear and Labelled: All your plots should have appropriate titles and axis labels. They need to be presented clearly so that they can be easily understood. 

Follow Tasks Instructions: You must follow the instructions exactly as given in each task and complete them. • Create a Report: You must create the report for Task 3 exactly as instructed. 

• Submit the report as a PDF file named “assignment1.pdf”. 

• You must include the headings and details as specified in the Task 3 instructions. 

Submit Correctly: You must follow the details specified in the “Submission Requirements” section to make your final submission.

Task 1 – Data Acquisition & Preparation (30%) 

Firstly, you need to acquire three data files “data1.csv”, “data2.csv”, and “data3.csv”, which are included in a single  .zip file named “assignment1_data.zip”, under the menu “Assignments” > “Assignment 1” in Canvas. Put these files  into your working folder for the assignment in Jupyter Lab ready to use.  

These data files are adapted from the “Student Drop out and Academic Success” data set in the UCI repository2, and contain many records of students with each record corresponding to a specification of a student in terms of various  attributes.  

The files “data1.csv” and “data2.csv” contain the same set of students but two distinct sets of attributes for  describing a student. In contrast, the file “data3.csv” contains a different set of students, where each record of the  student consists of all attributes from both “data1.csv” and “data2.csv”.  

The set of 38 possible attributes for a student record and their corresponding value ranges is shown in Table 2 in the  Appendix section of this document.  

As a data scientist, you have been asked to analyse the data from the three data files. However, before doing that you know that you need to carry out some data preparation operations, e.g., merging and cleaning the data.  

In this task, you are asked to utilise the Python package “Pandas” to do the following steps:  

1.1. Load the data from the three data files into three Pandas DataFrame entities and check whether each loaded  data sets is equivalent to the data contained in the raw data files.  

1.2. Merge the three data frames into a single one that contains all students, where each student has a unique ID  and is described by all the 38 attributes listed (see Table 2).  

1.3. Clean the data by using the knowledge you have learned.  

• You need to deal with the issues existing in the data, e.g., missing values, duplicates, impossible values and extra whitespaces. However, you must NOT modify any parts of data that do not suffer from issues. Failing to do so will lead to mark reduction. 

• When dealing with missing values (if any), you can remove an entire row or column ONLY IF more than 50% of its elements are missing. Otherwise, you must find other appropriate cleaning methods to handle missing values. 

• You must be able to explain how you detect each data issue and why you choose a specific cleaning method to deal with it. 

  Task 2 – Data Exploration (25%)  

At this point you should have finished Task 1 and obtained a single DataFrame containing the merged and cleaned  data. You can now start to explore your data by carrying out the following steps:  

2.1. Choose one column each with categorical and numerical values, respectively. Visualise the data of each column  type in an appropriate way. Note that you need to explore and identify potentially important columns, and be  able to justify your choice. Don’t just make a random choice. Explore and then decide. 

2.2. Choose three pairs of columns and explore the relationship within the column pairs using appropriate  descriptive statistics and visualisation tools. Like Task 2.1 don’t just make a random choice. Explore the data and  then decide. Your choice of column pairs should be done to address a “plausible hypothesis” on the data.  

2.3. Choose six (6) numerical columns and build a scatter matrix. State why you selected the columns.  

Note: Graphs (plots) must contain appropriate titles, axis labels, etc. to make themselves self-explained. Graphs  should be clear enough for readers to read and understand (size and information).  

Titles and labels on your graphs also help you to not be confused about what graph you are looking at! You will be  creating many graphs so it is worth doing this properly from the start.  

For your graphs, you will probably need to investigate what appropriate axes text labels to use, as the data set does  not have the text description of the numerical columns. 

  Task 3 – Report (45%)  

In this task, you are asked to write a report to elaborate your analyses and findings from Task 2 and 3.  

NOTE: In the report you will be explaining things.  

Do NOT include Python code in this report as that is already in your notebook file.  

When you are asked to explain how you did something, focus on the concepts or principles, not the code used.  You can refer to the cell where code is if you want but it usually not needed.  

We DO want to see your clear communication with words, and supported by graphs

 

Start by giving your report an appropriate title (the exact title is up to you), and include your name, student ID and  student email address, and the date of the report. This is a good professional communication standard to have for  all your reports. Make sure these details match exactly the details in your notebook file (assignment1.ipynb). 

Also include the Unit Code and Title, and the Year and Semester. The exact layout and formatting of the report is up  to you, but a simple template has been provided. You do NOT have to use the template. 

You should then:  

3.1. Create a sub-heading titled “Introduction” 

• In one paragraph (with approximately 3-4 sentences) clearly state the purpose of this report. Tip: Explain what the data source is, why you have written the report (past tense, “To communicate the findings of Task 2 and Task 3 of the assignment”), and what key findings (if any) you found (as a one-sentence summary.) This will set you up for the next two sub-sections. 

3.2. Create a sub-heading tilted “Task 1: Data Acquisition & Preparation” in your report under which you should:  • Briefly describe how you addressed this task. 

• Describe how you merged the data from the three data files 

• Describe each of the data issues you detected in data cleaning, explain how you detected it, and justify why you chose a specific data cleaning method to deal with it. 

• Discuss any problems you encountered when undertaking this task and how you solved them. 

3.3. Create a sub-heading named “Task 2: Data Exploration” in your report under which you need to:  • Create a sub-section with an appropriate title for each of the three sub-tasks in Task 2. 

• In the sub-section for sub-task 2.1, for each selected column, include the graph(s) created for that column, and provide a brief explanation on why you chose that column and a specific visualisation method to explore it. 

• In the sub-section for sub-task 2.2, briefly explain why you chose each of the three pairs of columns (e.g., stating the hypotheses that you intended to address), include the descriptive statistics and graph(s) for each of the three selected pairs, followed by a brief discussion on any interesting findings about the presence or lack of relationship between the two involved columns. 

• In the sub-section for sub-task 2.3, include the plot of the scatter matrix, state why you selected the six columns (i.e. were you hoping to see a particular relationship?) ,and report your findings from the plot. 

3.4. Create a sub-heading titled “Conclusions” 

• In one paragraph (with approximately 1-2 sentences) restate the key outcomes of this report. Don’t say anything that hasn’t already been stated before. Make sure it matches with the Introduction and the purpose you described there.

 

Note: You must give each graph a figure number and a brief caption (e.g. “Figure 1. Relationship shown between …”)  and you must refer to each figure in the text of your report. Don’t use words like “above” or “below (e.g. Don’t write  “In Figure 2 below it shows …” just use “In Figure 2 it shows …”). That way, your graphs will always make sense to the  reader and it does not matter if they are moved around! 

Tip: It is okay to add very clear sub-sub-headings to address each of the requirements above. Avoid large sentences  and large paragraphs. Try to be direct and concise with your words. We do not mark based on how many words you  write, but on the quality of the points you make. We do take marks away if there are too many unrelated words or if  the words don’t add a valuable point that has been asked for.

 

The report must be saved in the PDF format and named “assignment1.pdf” for submission. 

Your final report file MUST  

• be named “assignment1.pdf”, 

• be written in a single column format, with 

• font size between 10 and 12 points, and 

• have no more than 7 pages (including tables, graphs and/or references). 

Penalties (mark deductions) will apply if the report does not satisfy these requirements. Moreover, the quality of the  report will be considered when marking, e.g. organisation, clarity, and grammatical mistakes.  

Please remember to cite any sources (as “References”) which you have referred to when doing your work! Sometime  a “footnote” will be appropriate. Remember that citing sources is a way to show what you know and understand,  and should not be avoided. References and footnotes are evidence all good students will have in their reports.

  Submission Requirements  

The assignment is due at: 23:59, Friday 15 March 2024 

Assignments submitted after this time are subjected to late submission penalties. For detailed information, refer to the relevant section in the Unit Outline under the menu “Syllabus” in Canvas. 

You need to prepare the following three files:  

1. A notebook file named assignment1.ipynb which contains markdown headings, and all your code and code-level comments for Tasks 1 and 2. 

2. An HTML version of the notebook file with output as assignment1.html

3. A report file named assignment1.pdf which must strictly follow the format requirements detailed in Task 3. 

Note: Please make sure to clean the code before making submission to remove all unnecessary code. Ensure you see  all the data printed and all the graphs displayed as expected in your file.

 

To submit, you must upload these THREE files in Canvas under: “Assignments” > “Assignment 1” 

1. assignment1.ipynb 

2. assignment1.html 

3. assignment1.pdf 

Please do NOT submit any other unnecessary files. Marks will be deducted if you do. 

Extensions will only be permitted in exceptional circumstances. You should always backup your code and other  assignment-related documents frequently to avoid potential loss of progress. Note that any accidental loss of  progress, working while studying, and/or a heavy load of assignments will not be accepted as the exceptional  circumstances for an extension. For detailed information, please refer to the relevant section in the Unit Outline  under the menu “Syllabus” in Canvas. 

  Assessment Criteria  

Table 1 shows task number, summary details and points awarded when your work is assessed. A detailed  rubric is available in the Canvas unit website under “Assignments” > “Assignment 1”. See there for  complete details.  

Note that the total for Task 1 is 30 points, Task 2 is 25 points, and Task 3 is 45 points, for a total of 100  points. Deductions will occur if General Requirements, as stated earlier, are not followed.  

Table 1: Assessment Task, Summary Details and Points.  

Task Summary Details Points 1.1

Loading the data 

Loading the data from the three data files into three Pandas DataFrame entities and checking  whether the loaded data are equivalent to the data contained in the raw data files

Merging 

Merging the obtained three DataFrame entities into a single one that contains all records, where  each record has a unique ID and all the listed attributes. 

Cleaning 

Cleaning the data by using the knowledge you have acquired.

Visualising categorical and numerical values 

Choosing two columns with categorical and numerical values, respectively, and visualising each of them in an appropriate way.

Exploring relationships 

Choosing three pairs of columns and exploring the relationship between the two columns involved  in each pair via appropriate descriptive statistics and visualisation tools.

Scatter matrix 

Choose six (6) numerical columns of interest and generate a scatter matrix. 

Introduction 

Purpose of report clearly stated and suitable preparation information for the next two sections of  the report given. Content is consistent with the rest of the report.

Report Section – Task 1 

Report on “Task 1: Data Acquisition & Preparation”

Report Section – Task 2 

Report on “Task 2: Data Exploration”

Conclusion 

Clear restating of the key outcomes presented in the report.

Total Points 

 

Appendix: Data Attribute Details 

Table 2: The set of 38 possible attributes for a student record with corresponding value ranges. 

Feature Type Min Max

ID 

Discrete 

5000

Marital status 

Discrete 

Application mode 

Discrete 

60

Application order 

Discrete 

Course 

Discrete 

10 

10000

Daytime/evening attendance 

Discrete 

Previous qualification 

Discrete 

45

Previous qualification (grade) 

Continuous 

95 

190

Nationality 

Discrete 

109

Mother’s qualification 

Discrete 

45

Father’s qualification 

Discrete 

45

Mother’s occupation 

Discrete 

200

Father’s occupation 

Discrete 

200

Admission grade 

Continuous 

95 

190

Displaced 

Discrete 

Educational special needs 

Discrete 

Debtor 

Discrete 

Tuition fees up to date 

Discrete 

Gender 

Discrete 

Scholarship holder 

Discrete 

Age at enrolment 

Discrete 

17 

70

International 

Discrete 

Curricular units 1st sem (credited) 

Discrete 

50

Curricular units 1st sem (enrolled) 

Discrete 

50

Curricular units 1st sem (evaluations) 

Discrete 

50

Curricular units 1st sem (approved) 

Discrete 

50

Curricular units 1st sem (grade) 

Continuous 

20.00

Curricular units 1st sem (without evaluations) 

Discrete 

50

Curricular units 2nd sem (credited) 

Discrete 

50

Curricular units 2nd sem (enrolled) 

Discrete 

50

Curricular units 2nd sem (evaluations) 

Discrete 

50

Curricular units 2nd sem (approved) 

Discrete 

50

Curricular units 2nd sem (grade) 

Continuous 

20.00

Curricular units 2nd sem (without evaluations) 

Discrete 

12

Unemployment rate 

Continuous 

-100 

100

Inflation rate 

Continuous 

-100 

100

GDP 

Continuous 

-100 

100

Target (Dropout, Enrolled, Graduate) 

Reference no: EM132069492

WhatsApp
Hello! Need help with your assignments? We are here

GRAB 25% OFF YOUR ORDERS TODAY

X