FIT2086 Studio 9 Supervised Machine Learning Methods Assessment Answer

Assignment Questions

1 IntroductionStudio 8 introduces you to several supervised machine learning techniques; in particular, you will look at using decision trees for regression and classification, as well as k nearest neighbours methods. To complete this Studio, you will need to install three packages: rpart, randomForest and kknn in R.During your Studio session, your demonstrator will go through the answers with you, both on the board and on the projector as appropriate. Any questions you do not complete during the session should be completed out of class before the next Studio. Complete solutions will be released on the Friday after your Studio.2 Decision TreesIn the first part of this Studio we will look at how to learn a basic decision trees from data, how to visualise/interpret the tree, and how to make predictions. We we will look at both continuous targets (regression trees) as well as categorical targets (classification trees). Begin by ensuring that the rpart package is loaded.1. Load the diabetes.train.csv and diabetes.test.csv data into R. Use summary() to inspect your training data; you will see that it has 10 predictors, AGE, SEX, BMI, BP (blood pressure) and six blood serum measurements S1 through to S6; it also has a target variable Y, which is a measure of diabetes progression over a fixed period of time. The higher this value, the worse the diabetes progression.2. Let us fit a decision tree to our training data. To do this, usetree.diabetes = rpart(Y ~ ., diabetes.train)which fits a decision tree to the data using some basic heuristics to decide when to stop growing the tree.3. We can also explore the relationships between the diabetes progression outcome variable (Y) and the predictor variables that were used by the decision tree package. We will first do this by examining the decision tree in the console:tree.diabetesThis displays the tree in text form. The asterisks “*” denote the terminal (leaf) nodes of the tree, and the nodes without asterisks are split nodes; the information contains which variables are split on, what the splits are, and for each leaf node, what the predicted value of Y is. How many leaf nodes are there in this tree? Which variables has the tree used to predict diabetes progression?4. The output of the above command produces all the information related to the tree we have learned but can be hard to understand. It is easier to visualise the decision tree by using the plot() function to get a graphical representation of the relationships:plot(tree.diabetes) text(tree.diabetes, digits=3)This displays the tree, along with the various decision rules at each split node, and the predicted value of Y at each leaf. You may need to click the “Zoom” button above the plot in R Studio to get this picture displayed more clearly. Using this information answer the following questions:What is the estimated average diabetes progression for individuals with BMI = 28.0, blood pressure (BP) = 96 and S6 = 110?What is the estimated average diabetes progression for individuals with BMI = 20.1, S5 = 4.7 and S3 = 38?Find the characteristics of the individuals with the worst (highest) predicted average dia- betes progression.The rpart package provides a measure of importance for each variable. To access this typetree.diabetes$variable.importanceThis reports the variables in order of importance, the importance being defined by the amount that they increase the goodness-of-fit of the tree to the data. Larger scores are better, though the numbers are themselves defined in terms of an arbitrary unit so it might be better to usetree.diabetes$variable.importance / max(tree.diabetes$variable.importance)which normalizes the importance scores so that they are relative to the importance of the most important predictor. Which three predictors are the most important?1. We can now test to see how well this tree predicts onto future data. We can use the predict() function to get predictions for new data, and then calculate the root-mean squared error (RMSE):sqrt(mean((predict(tree.diabetes, diabetes.test) – diabetes.test$Y)^2))
How can we interpret this score?2. The rpart package provides the ability to use cross-validation to try and “prune” the tree down, removing extra predictors and simplifying the tree without damaging the predictions too much (and potentially improving them). The code is a little involved, so I have included a wrapper function in the file wrappers.R; source this file to load the wrapper functions into memory. Then, to perform CVcv = learn.tree.cv(Y ~.,data=diabetes.train,nfolds=10,m=1000)The nfolds parameter tells the code how many different ways to break up the data, and a value of 10 is usually fine. The m parameter tells the code how many times to repeat the cross-validation process (randomly dividing the data up, training on some of the data, testing on the remaining data) – the higher this number is, the less the trees found by CV will vary from run to run, as it reduces the random variability in the cross-validation scores – but the longer the training will take as we are doing more cross-validation tests. The cv object returned by learn.tree.cv() contains three entries. The cv$cv.stats object contains the statistics of the cross-validation. We can visual this using:plot.tree.cv(cv)This shows the cross-validation score (y-axis) against the tree size in terms of number of leaf nodes (x-axis). The cv$best.cp entry is the best value of the complexity parameter for our dataset, as estimated by CV, and can be passed to prune.rpart() to prune our tree down. The optimum number of leaf nodes is plotted in red.The cv$best.tree object contains the pruned tree, using cv$best.cp as the pruning complexity parameter. Plot this tree, and compare it to the previous tree tree.diabetes. How do they compareHas cross-validation removed any predictor variables from the original tree tree.diabetes?What are the characteristics that predict the worst diabetes progression in this new tree?What is the RMSE for this new tree cv$best.tree on the test data?We can now compare the performance of our decision tree to a standard linear model. Use the glmnet package to fit a linear model using the lasso, and calculate the RMSE for the fitted model: lasso.fit = cv.glmnet.f(Y ~ ., data=diabetes.train) glmnet.tidy.coef(lasso.fit) sqrt(mean((predict.glmnet.f(lasso.fit,diabetes.test)-diabetes.test$Y)^2))How does the linear model compare in terms of which predictors it has chosen to use with the tree selected by CV?3 Random ForestsA random forest is a collection of classification or regression trees that are grown by controlled, random splitting. Once grown, all the trees in a random forest are used to make predictions and to determine which predictors are associated with the outcome variable. However, the relationship between the predictors and the target is much more opaque than for a decision tree or linear model. In order to use random forests in R you must install and load the randomForest package.First, use R to learn a random forest from the diabetes.train data set:rf.diabetes = randomForest(Y ~ ., data=diabetes.train)
This trains a forest of decision trees on our data.Unlike a single decision tree, a random forest is difficult to visualise and interpret as it consists of many hundreds or thousands of trees. After learning the random forest from the data, we can inspect the model by typing: rf.diabetesThis returns some basic information about the model, such as the percentage of variance ex- plained by the tree (roughly equivalent to 100 R2).1. To see how well our random forest predicts onto our testing data we can use the predict()function and calculate RMSE:,sqrt(mean((predict(rf.diabetes, diabetes.test) – diabetes.test$Y)^2))We can see that the random forest performs quite a bit better than our single best decision tree, and is basically the same as the linear model in this case.2. So far we have been run the random forest package using the default settings for all parameters. Although the package randomForest has many interesting user-settable options (see the help for more details), the following three options are most useful for common use: ntree: Specifies the number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times (Default: 500)importance: Should importance of predictors be computed? (Default: FALSE) Let’s explore how we can use these options when analysing our diabetes data set.rf.diabetes = randomForest(Y ~ ., data=diabetes.train, importance=TRUE, ntree=5000)The number of trees in this example is set to 5, 000. In general, using more trees leads to improvements in prediction error. However, the computation complexity of the algorithm grows with the number of trees which means large forests can take a long time to learn and use for prediction. Calculate RMSE on the test data for this new random forest.3. The option importance tells the random forest package that we wish to rank our predictor variables in terms of their strength of associat

Continue to order Get a quote

Calculate the price of your order

Type of paper needed:

Pages:

550 words

Academic level:

We'll send you the first draft for approval by September 11, 2018 at 10:52 AM

Total price:

$26

The price is based on these factors:

Academic level

Number of pages

Urgency

Basic features

Free title page and bibliography
Unlimited revisions
Plagiarism-free guarantee
Money-back guarantee
24/7 support

On-demand options

Writer’s samples
Part-by-part delivery
Overnight delivery
Copies of used sources
Expert Proofreading

Paper format

275 words per page
12 pt Arial/Times New Roman
Double line spacing
Any citation style (APA, MLA, Chicago/Turabian, Harvard)

FIT2086 Studio 9 Supervised Machine Learning Methods Assessment Answer

Products

Recent Posts

Calculate the price of your order

Our guarantees

Money-back guarantee

Zero-plagiarism guarantee

Free-revision policy

Privacy policy

Fair-cooperation guarantee