DS6503 – Data Mining Tools and Techniques

Connect with DS6503 Expert Now

Assignment Task

 

Application of Data Mining Techniques

Learning Outcomes covered:

Define the data requirements for a range of analytical problems

Identify and explain the basic application of a variety of commonly used data mining techniques

Perform an introductory analytical investigation using the data science process and a statistical programming tool

Assignment Deliverables

You may work on this assignment as individuals or in a group not exceeding 3 students. Only one submission per group is required. Your submission should include a Word report and an R Studio script. These deliverables are described in more detail below.

1. R Studio ScriptA script created in R Studio that performs the processing described overleaf. Your script should include comments that briefly describe the processing performed to complete each task in the scenarios. Include all required libraries used by the functions within your script, but omit statements that install these libraries.

2. Analysis Report

A report that describes the analysis performed in more detail and addresses the specific questions raised in each scenario. Include code snippets from your script to illustrate the discussion of the analysis performed and the visualisations that were generated.

Zip-up a folder that contains both the R Studio script file and the Word report and submit via Moodle using the following filename YourStudentID_DS6503.Zip by the due date.

Extensions

Extensions of time will be granted for students who have an acceptable documented reason for not completing the assessment by the specified due date.

Scenario 1: Predicting Software Reselling Profits.

Tayko Software is a software catalogue firm that sells games and educational software. It started out as a software manufacturer and then added third-party titles to its offerings. It recently revised its collection of items in a new catalogue, which it mailed out to its customers. This mailing yielded 2000 purchases. Based on this data, Tayko wants to devise a model for predicting the spending amount that a purchasing customer will yield. The file Tayko.csv contains information on 2000 purchases. Table 6.10 describes the variables to be used in the problem (the Excel file contains additional variables).

Explore the relationship between spending and two predictors by creating two scatterplots (Spending vs. Freq, and Spending vs. last_update_days_ago). Does there seem to be a linear relationship?

To fit a predictive model for Spending:

Partition the 2000 records into training and validation sets (60:40 split).

Run a multiple linear regression model for Spending vs. all six predictors. Using the summary details of the generated regression model, what is the equation that this model has estimated to predict values for Spending.

Based on this model, what type of purchaser is most likely to spend a large amount of money?

Based on the summary details of this model, which predictor should be dropped first from the model if we wished to reduce the number of predictors used in the regression model?

Using the regression equation generated from the model and the column values for the first purchase record in the validation set, predict the value for Spending. Compare this with the actual value for Spending to determine the prediction error.

Evaluate the predictive accuracy of the model by examining its performance on the validation set.

Create a histogram of the model residuals. Do they appear to follow a normal distribution? How does this affect the predictive performance of the model?

Scenario 2: Predicting Airfare on New Routes

The following problem takes place in the United States in the late 1990s, when many major US cities were facing issues with airport congestion, partly as a result of the 1978 deregulation of airlines. Both fares and routes were freed from regulation, and low-fare carriers such as Southwest (SW) began competing on existing routes and starting nonstop service on routes that previously lacked it. Building completely new airports is generally not feasible, but sometimes decommissioned military bases or smaller municipal airports can be reconfigured as regional or larger commercial airports. There are numerous players and interests involved in the issue (airlines, city, state and federal authorities, civic groups, the military, airport operators), and an aviation consulting firm is seeking advisory contracts with these players. The firm needs predictive models to support its consulting service. One detail the firm might want to be able to predict is fares, in the event a new airport is brought into service. The firm starts its analysis using data within Airfares.csv, which contains real data that were collected between Q3-1996 and Q2-1997. The variables in these data are listed in Table 6.11, and are believed to be important in predicting FARE. Some airport-to-airport data are available, but most data are at the city-to-city level. One question that will be of interest in the analysis is the effect that the presence or absence of Southwest has on FARE.

Explore the numerical predictors and response (FARE) by creating a correlation table and examining some scatterplots between FARE and those predictors. Which column seems to be the best single predictor of FARE?

Find a model for predicting the average fare on a new route:

Convert the following categorical variables into dummy variables; VACATION, SW, SLOT and GATE. This conversion should be performed as follows; VACATION from Yes/No to 1/0, SW from Yes/No to 1/0, SLOT from Controlled/Free to 1/0 and GATE from Constrained/Free to 1/0. Now partition the data into training and validation sets using a 60/40 split.

Fit a regression model to the training data. You can ignore the first four predictors (S_CODE, S_CITY, E_CODE, E_CITY).

Repeat (ii) using exhaustive search to reduce the number of predictors in the model. Compare the resulting best model to the one you obtained in (ii) in terms of the predictors used within these models.

Compare the predictive accuracy of both models (ii) and (iii) using the measures RMSE and average error (ME).

Using model (iii), predict the average fare on a route with the following characteristics: COUPON = 1.202, NEW = 3, VACATION = No (0), SW = No (0), HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_POP = 4,557,004, E_POP = 3,195,503, SLOT = Free (0), GATE = Free (0), PAX = 12,782, DISTANCE = 1976 miles.

Predict the reduction in average fare on the route in (v) if Southwest decides to cover this route [using model (iii)] i.e. SW = Yes (1).

In reality, which of the columns in the data set will not be available for predicting the average fare from a new airport (i.e., before flights start operating on those routes)?

Select a model that includes only columns that are available before flights begin to operate on the new route. Use an exhaustive search to find such a model.

Use the model in (viii) to predict the average fare on a route with characteristics COUPON = 1.202, NEW = 3, VACATION = No (0), SW = No (0), HI = 4442.141, S_INCOME = $28,760, E_INCOME = $27,664, S_ POP = 4,557,004, E_POP = 3,195,503, SLOT = Free (0), GATE = Free (0), PAX = 12782, DISTANCE = 1976 miles. Use only those columns selected

viii. Compare the predictive accuracy of this model with model (iii). Is this model good enough, or is it worthwhile re-evaluating the model once flights begin on the new route?

This DS6503 – IT Assignment has been solved by our IT experts at Schooling Best. Our Assignment Writing Experts are efficient to provide a fresh solution to this question. We are serving more than 10000+ Students in Australia, UK & US by helping them to score HD in their academics. Our Experts are well trained to follow all marking rubrics & referencing style.

Be it a used or new solution, the quality of the work submitted by our assignment experts remains unhampered. You may continue to expect the same or even better quality with the used and new assignment solution files respectively. There’s one thing to be noticed that you could choose one between the two and acquire an HD either way. You could choose a new assignment solution file to get yourself an exclusive, plagiarism (with free Turnitin file), expert quality assignment or order an old solution file that was considered worthy of the highest distinction.

  • Uploaded By : Brett

  • Posted on : December 23rd, 2019

  • Downloads : 0

Order New Solution

Can’t find what you’re looking for?

This Christmas, The Most Magical Time of The Year, save up to 65% on Our one-to-one Academic Assistance.

Grab Now

Reference no: EM132069492

WhatsApp
Hello! Need help with your assignments? We are here

GRAB 25% OFF YOUR ORDERS TODAY

X