# Assignment 4

Data Science for Business Due 04/25/2021

Part 1: Regression

Correlation and regression analysis are related in the sense that both deal with relationships among variables. The correlation coefficient is a measure of linear association between two variables. Values of the correlation coefficient are always between -1 and +1
In the following Linear Regression applet there are 10 points plotted in the coordinate plane. The line in the graph represents the best fit line for these 10 points. The correlation coefficient symbol is r.

https://www.geogebra.org/m/rJj6yr6C#material/nFJp7McJ

Interact with this applet by repositions the points (by dragging the points) before start answering the following questions:
1. Reposition the points so that the correlation coefficient (r) to 1. What does it mean to have r =1?
2. Reposition the points so that the correlation coefficient (r) to -1. What does it mean to have r =-1?
3. Reposition the points so that the correlation coefficient (r) to 0 or very close to zero. What does it mean to have r =0?
Include screenshots for every part and make a comparison between the three different scenarios in terms of the correlation between the two variables. Discuss your results.

Part 2: K-Means Clustering

In the following link you will find a visualization to the K-Means Clustering Algorithm.

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Read the article and try to test the visualization before start answering the following questions:
In the following questions, use the
same dataset
to make comparisons between the three different strategies: (1) you choose the centroids, (2) Randomly, or (3) choose the farthest point.
1. Choose the first strategy to initial the centroids by “choosing them by yourself”. Include screen shots for the steps. How many iterations the algorithm did till it finds the best clusters?
2. Choose the second strategy to randomly choose the centroids. How many iterations the algorithm did till it finds the best clusters?
3. Choose the third strategy by using the Farthest point as the centroids. How many iterations the algorithm did till it finds the best clusters?

Discuss your conclusion about using the three different strategies. Add any interesting facts/notes that you found when tried this visualization.

