Assignment Task
Detailed Coursework Specification
The task is to implement ML solutions and produce a written report individually. Two datasets are provided (see below): you should choose one of them and carry out all the associated tasks. Please see below for further details. Whichever option you choose, you need to use the dataset downloaded from the link listed in this coursework specifications. This is because the datasets provided have been modified to be better tailored to this specific module. While they are in principle available for download in their original form,
Dataset
Option 1: Predicting the severity of road accidents in the UK.
Emergency services in the UK (non-commercial entities) are looking into developing a system to provide a more effective response in case of road accidents. They want to know if it would be possible to predict the severity of an accident using some variables that can potentially be gathered on the place of the accident. They provide you with a dataset of all the accidents that occurred in 2019 in the UK which contain their variable of interest. They want you to use machine learning techniques to predict the accident severity in details (that is, whether an accident is “fatal”, “serious” or “slight” – 3 classes in total), from all the other features in the dataset. For the avoidance of doubt, the information about accident severity is contained in the column titled “accident_severity”. They want to compare traditional machine learning algorithms with neural networks to see if the latter offer significantly higher performance. They want you to write the results of your analysis and implementation in a report. More details about what to include in the report are provided below.
Option 2: Predicting the topic of customers’ banking questions.
You are consulting for a bank (a commercial entity) on how to make their online customer service more effective. They want to trial an automated first level of filtering for questions asked by customers in an online chat. Specifically, they want to know whether it is possible to identify the topic a question relates to, for some specific topics of interest. You are provided with a dataset with sample questions and their associated topic. They want you to use machine learning to predict the question topic in details (that is, whether a question is about “card queries or issues”, “needs troubleshooting”, “top up queries or issues” or “other” – 4 classes in total). For the avoidance of doubt, the question topic is contained in the column titled “label”. They want to compare traditional machine learning algorithms with neural networks to see if the latter offer significantly higher performance. They want you to write the results of your analysis and implementation in a report. More details about what to include in the report are provided below.
1. Executive summary
Briefly summarize what the report contains. That is: the task you are solving and why it is important; the outline of the ML methods you implemented and any experiments performed; the summary of your results and your conclusions.
2. Exploratory data analysis
Describe the exploratory data analysis performed and comment on what its implications are for the machine learning task. As part of the exploratory data analysis, you should use dimensionality reduction techniques to show the dataset (including the target labels) in a 2- dimensional plot.
3. Data preprocessing
Describe the steps performed for data cleaning, splitting (training/validation/test) and preprocessing (where appropriate: normalization/standardization, imputation of missing values, feature encoding, over/under-sampling, text processing). Provide justifications, based on theory and/or experiments, for your design choices.
4. Classification using traditional machine learning
Describe your solution to the classification task (accident severity for option 1 and question topic for option 2) using traditional machine learning techniques. You should describe the final model hyper-parameters in details, ideally in a table, and give a brief explanation of how the algorithm works. Describe the experiments you did to optimize your model (hyper-parameters optimization and comparison with other models) – these experiment should be rigorous and follow best practice. Provide justifications, based on theory and/or experiments, for your design choices. Evaluate the model performance using a) a confusion matrix; b) two performance metrics (explain what each metric compute, why it is an appropriate metric to use and what are the implications of the results for the task you are solving); c) a comparison with one “trivial” baseline (for example, random guess or majority class).
5. Classification using neural networks
Describe your solution to the classification task (accident severity for option 1 and question topic for option 2) using neural networks. You should describe the final model hyperparameters in details, ideally in a table, and give a brief explanation of how the algorithm works.
6. Ethical discussion
Identify and discuss some of the social and ethical implications of your chosen task, from data collection and processing to the ML prediction. It is highly recommended that you structure the discussion using either Data Hazard Labels or the Ethical OS Toolkit. The discussion should take into account communities and people that may be affected by the ML system.
7. Recommendations
- You should provide three bullet points detailing the following:
- Which of your machine learning model is the best candidate for the task and why.
- Whether the final model is good enough to be used in practice and why (or why not).
- Your top suggestion for future improvements and why.
7. Retrospective
The last section in the report is a reflection on the work you have done for this coursework. You should write a maximum of words answering the following question: if you were to start the coursework all over again, what aspect of it would you want to investigate more in depth and why?