FIT2086 Studio 9 Supervised Machine Learning Methods Assessment Answer

1 IntroductionStudio 8 introduces you to several supervised machine learning techniques; in particular, you will look   at using decision trees for regression and classification, as well as k nearest neighbours methods. To complete this Studio, you will need to install three packages: rpart, randomForest and kknn in R.During your Studio session, your demonstrator will go through the answers with you, both on the board and on the projector as appropriate. Any questions you do not complete during the session should be completed out of class before the next Studio. Complete solutions will be released on the Friday after your Studio.2 Decision TreesIn the first part of this Studio we will look at how to learn a basic decision trees from data, how to visualise/interpret the tree, and how to make predictions. We we will look at both continuous targets (regression trees) as well as categorical targets (classification trees). Begin by ensuring that the rpart package is loaded.1. Load the  diabetes.train.csv  and  diabetes.test.csv  data into R. Use summary() to inspect your training data; you will see that it has 10 predictors, AGE, SEX, BMI, BP (blood pressure) and six blood serum measurements S1 through to S6; it also has a target variable Y, which is a measure of diabetes progression over a fixed period of time. The higher this value, the worse the diabetes progression.2. Let us fit a decision tree to our training data. To do this, usetree.diabetes = rpart(Y ~ ., diabetes.train)which fits a decision tree to the data using some basic heuristics to decide when to stop growing the tree.3. We can also explore the relationships between the diabetes progression outcome variable (Y) and the predictor variables that were used by the decision tree package. We will first do this by examining the decision tree in the console:tree.diabetesThis  displays  the  tree  in  text  form.  The  asterisks  “*”  denote  the  terminal  (leaf)  nodes  of  the tree, and the nodes without asterisks are split nodes; the information contains which variables are split on, what the splits are, and for each leaf node, what the predicted value of Y is. How many leaf nodes are there in this tree? Which variables has the tree used to predict diabetes progression?4. The output of the above command produces all the information related to the tree we have learned but can be hard to understand. It is easier to visualise the decision tree by using the plot() function to get a graphical representation of the relationships:plot(tree.diabetes) text(tree.diabetes, digits=3)This displays the tree, along with the various decision rules at each split node, and the predicted value of Y at each leaf. You may need to click the “Zoom” button above the plot in R Studio to get this picture displayed more clearly. Using this information answer the following questions:What is the estimated average diabetes progression for individuals with BMI = 28.0, blood pressure (BP) = 96 and S6 = 110?What is the estimated average diabetes progression for individuals with BMI = 20.1, S5 = 4.7 and S3 = 38?Find the characteristics of the individuals with the worst (highest) predicted average dia- betes progression.The rpart package provides a measure of importance for each variable. To access this typetree.diabetes$variable.importanceThis reports the variables in order of importance, the importance being defined by the amount that they increase the goodness-of-fit of the tree to the data. Larger scores are better, though the numbers are themselves defined in terms of an arbitrary unit so it might be better to usetree.diabetes$variable.importance / max(tree.diabetes$variable.importance)which normalizes the importance scores so that they are relative to the importance of the most important predictor. Which three predictors are the most important?1. We  can now test to see how well this tree predicts onto future data.  We  can use the  predict() function to get predictions for new data, and then calculate the root-mean squared error (RMSE):sqrt(mean((predict(tree.diabetes, diabetes.test) – diabetes.test$Y)^2))
How can we interpret this score?2. The rpart package provides the ability to use cross-validation to try and “prune” the tree down, removing extra predictors and simplifying the tree without damaging the predictions too much (and potentially improving them). The code is a little involved, so I have included a wrapper function in the file wrappers.R; source this file to load the wrapper functions into memory. Then, to perform CVcv = learn.tree.cv(Y ~.,data=diabetes.train,nfolds=10,m=1000)The nfolds parameter tells the code how many different ways to break up the data, and a value of 10 is usually fine. The m parameter tells the code how many times to repeat the cross-validation process (randomly dividing the data up, training on some of the data, testing on the remaining data) – the higher this number is, the less the trees found by CV will vary from run to run, as  it reduces the random variability in the cross-validation scores – but the longer the training will take as we are doing more cross-validation tests. The cv object returned by learn.tree.cv() contains three entries. The cv$cv.stats object contains the statistics of the cross-validation. We can visual this using:plot.tree.cv(cv)This shows the cross-validation score (y-axis) against the tree size in terms of number of leaf nodes (x-axis). The cv$best.cp entry is the best value of the complexity parameter for our dataset, as estimated by CV, and can be passed to prune.rpart() to prune our tree down. The optimum number of leaf nodes is plotted in red.The cv$best.tree object contains the pruned tree, using cv$best.cp as the pruning complexity parameter. Plot this tree, and compare it to the previous tree tree.diabetes. How do they compareHas cross-validation removed any predictor variables from the original tree tree.diabetes?What are the characteristics that predict the worst diabetes progression in this new tree?What is the RMSE for this new tree cv$best.tree on the test data?We can now compare the performance of our decision tree to a standard linear model. Use the glmnet package to fit a linear model using the lasso, and calculate the RMSE for the fitted model: lasso.fit = cv.glmnet.f(Y ~ ., data=diabetes.train) glmnet.tidy.coef(lasso.fit) sqrt(mean((predict.glmnet.f(lasso.fit,diabetes.test)-diabetes.test$Y)^2))How does the linear model compare in terms of which predictors it has chosen to use with the tree selected by CV?3 Random ForestsA random forest is a collection of classification or regression trees that are grown by controlled, random splitting. Once grown, all the trees in a random forest are used to make predictions and to determine which predictors are associated with the outcome variable. However, the relationship between the predictors and the target is much more opaque than for a decision tree or linear model. In order to use random forests in R you must install and load the randomForest package.First, use R to learn a random forest from the diabetes.train data set:rf.diabetes = randomForest(Y ~ ., data=diabetes.train)
This trains a forest of decision trees on our data.Unlike a single decision tree, a random forest is difficult to visualise and interpret as it consists of many hundreds or thousands of trees. After learning the random forest from the data, we can inspect the model by typing: rf.diabetesThis returns some basic information about the model, such as the percentage of variance ex- plained by the tree (roughly equivalent to 100 R2).1. To see how well our random forest predicts onto our testing data we can use the predict()function and calculate RMSE:,sqrt(mean((predict(rf.diabetes, diabetes.test) – diabetes.test$Y)^2))We can see that the random forest performs quite a bit better than our single best decision tree, and is basically the same as the linear model in this case.2. So far we have been run the random forest package using the default settings for all parameters. Although the package randomForest has many interesting user-settable options (see the help for more details), the following three options are most useful for common use: ntree: Specifies the number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times (Default: 500)importance: Should importance of predictors be computed? (Default: FALSE) Let’s explore how we can use these options when analysing our diabetes data set.rf.diabetes = randomForest(Y ~ ., data=diabetes.train, importance=TRUE, ntree=5000)The number of trees in this example is set to 5, 000. In general, using more trees leads to improvements in prediction error. However, the computation complexity of the algorithm grows with the number of trees which means large forests can take a long time to learn and use for prediction. Calculate RMSE on the test data for this new random forest.3. The option importance tells the random forest package that we wish to rank our predictor variables in terms of their strength of associat

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
Open chat
1
You can contact our live agent via WhatsApp! Via + 1 929 473-0077

Feel free to ask questions, clarifications, or discounts available when placing an order.

Order your essay today and save 20% with the discount code GURUH