Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

IN R Follow the steps given in Machine Learning With R, Chapter 3 section \"Diag

ID: 3698427 • Letter: I

Question

IN R

Follow the steps given in Machine Learning With R, Chapter 3 section "Diagnosing Breast Cancer with the kNN Algorithm."

https://mbsdirect.vitalsource.com/books/9781784394523/epubcfi/6/52[;vnd.vst.idref=id286762181]!/4[page]/2/18/6/24/2@0:75.7

Follow the steps given in Machine Learning With R, Chapter 9 section "Finding Teen Market Segments Using k-Means Clustering."

https://mbsdirect.vitalsource.com/books/9781784394523/epubcfi/6/100[;vnd.vst.idref=id286799594]!/4[page]/2/18/10@0:0

Prepare a brief that reports the execution steps and outcomes of the k-NN and k-means lab. In addition, the following questions will be addressed in the brief:

Use the caret package to automatically tune the k parameter for the k-NN algorithm. Were you able to identify a k parameter that increased the accuracy from previous attempts? Show your work and the final result.
Train the k-means model again using k=3 and then k=10. How did this affect the cluster distribution for mean age and proportion of females?

Please submit screenshots in work document

Explanation / Answer

Answer:

Execution Steps for Diagnosing Breast Cancer with the kNN Algorithm :

Wbcd<-read.csv(“wisc_bc_data.csv”, stringsAsFactors = FALSE)

3. Then str(wbcd) command is used to confirm that the data file is having 569 different sample records and 32

parameters to be observed.

4. Now in the next step using, wbcd <- wbcd[-1], id column which is the first column is excluded from the data frame.

5. The variable diagnosis in : table(wbcd$diagnosis) is basically the outcome which we are trying to predict. This command will show whether the samples provided are benign or malign as B or M respectively. The table() output indicates that 357 masses are benign while 212 are malignant:

B                      M

357               212

6. Also in R machine learning classifier we have to code the target feature as a factor so after recoding as:

wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c(“B”,”M”), labels = c(“Benign”,”Malignant”))

where B and M are given Benign and Malign labels.

7. Now using : round(prop.table(table(wbcd$diagnosis))*100, digits = 1), we get as ouput: Benign                           Malignant

62.7                             37.3

            Which are in %form of masses.

8. Now using command:

summary(wbcd[c(“radius_mean”, “area_mean”, “smoothness_mean”)])

we are looking at three different features with their measurements as described below:

radius_mean                           area_mean                 smoothness_mean

Min.   : 6.981     Min.   : 143.5   Min.   :0.05263

1st Qu.:11.700     1st Qu.: 420.3    1st Qu.:0.08637

Median :13.370    Median : 551.1    Median :0.09587

Mean   :14.127    Mean   : 654.9   Mean   :0.09636

3rd Qu.:15.780     3rd Qu.: 782.7   3rd Qu.:0.10530

Max.   :28.110     Max.   :2501.0   Max.   :0.16340

9. Now k-NN algorithm requires distance to be calculated as finding the difference between the values so in next step normalization is done to get standard range of values as:

normalize<-function(x){return ((x-min(x))/(max(x)-min(x))}

Using normalization all the values for different features will appear to be at the same level.

But using normalize() will be done for single feature at a time, so to apply for a list of elements, use lapply as:

wbcd_n <- as.data.frame(lapply(wbcd[2:31], nomalize))

Now using this all columns from 2 to 31 of wbcd data frame are normalized.

10. Now to check for the ouput of normalized values of all features use below command as:

Min.                 1st Qu              Median            Mean               3rd Qu.                         Max.

0.0000             0.1174             0.1729             0.2169             0.2711             1.0000

11. Now we will split the wbcd_n data frame into wbcd_train and wbcd_test dataframes as:

wbcd_train <- wbcd_n[1:469,]

and, wbcd_test <- wbcd_n[470:569,]

which are extracted in [row,colmn] syntax fashion, i.e., in wbcd_test rows from 470 to 569 will be saved and in wbcd_train dataframe rows from 1 to 469 will get saved.

12. The above dataframes created does not have diagnosis variable, so to do so use below command:

wbcd_train_labels <- wbcd[1:469,1]

and, wbcd_test_labels <- wbcd[470:569,1]

where diagnosis is the column 1 in wbcd dataframe.

13. Now to implement kNN class package is installed if not already installed as:

install.packages("class")

14. To load the package to use it : library(class).

15. Now use knn() for our problem to get the desired output as vector for each row in the data frame as:

wbcd_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=3)

where k specifies the number of neighbour points od values to be included for the classification.