IN R Follow the steps given in Machine Learning With R, Chapter 3 section \"Diag
ID: 3698427 • Letter: I
Question
IN R
Follow the steps given in Machine Learning With R, Chapter 3 section "Diagnosing Breast Cancer with the kNN Algorithm."
https://mbsdirect.vitalsource.com/books/9781784394523/epubcfi/6/52[;vnd.vst.idref=id286762181]!/4[page]/2/18/6/24/2@0:75.7
Follow the steps given in Machine Learning With R, Chapter 9 section "Finding Teen Market Segments Using k-Means Clustering."
https://mbsdirect.vitalsource.com/books/9781784394523/epubcfi/6/100[;vnd.vst.idref=id286799594]!/4[page]/2/18/10@0:0
Prepare a brief that reports the execution steps and outcomes of the k-NN and k-means lab. In addition, the following questions will be addressed in the brief:
Use the caret package to automatically tune the k parameter for the k-NN algorithm. Were you able to identify a k parameter that increased the accuracy from previous attempts? Show your work and the final result.
Train the k-means model again using k=3 and then k=10. How did this affect the cluster distribution for mean age and proportion of females?
Please submit screenshots in work document
Explanation / Answer
Answer:
Execution Steps for Diagnosing Breast Cancer with the kNN Algorithm :
Wbcd<-read.csv(“wisc_bc_data.csv”, stringsAsFactors = FALSE)
3. Then str(wbcd) command is used to confirm that the data file is having 569 different sample records and 32
parameters to be observed.
4. Now in the next step using, wbcd <- wbcd[-1], id column which is the first column is excluded from the data frame.
5. The variable diagnosis in : table(wbcd$diagnosis) is basically the outcome which we are trying to predict. This command will show whether the samples provided are benign or malign as B or M respectively. The table() output indicates that 357 masses are benign while 212 are malignant:
B M
357 212
6. Also in R machine learning classifier we have to code the target feature as a factor so after recoding as:
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c(“B”,”M”), labels = c(“Benign”,”Malignant”))
where B and M are given Benign and Malign labels.
7. Now using : round(prop.table(table(wbcd$diagnosis))*100, digits = 1), we get as ouput: Benign Malignant
62.7 37.3
Which are in %form of masses.
8. Now using command:
summary(wbcd[c(“radius_mean”, “area_mean”, “smoothness_mean”)])
we are looking at three different features with their measurements as described below:
radius_mean area_mean smoothness_mean
Min. : 6.981 Min. : 143.5 Min. :0.05263
1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
Median :13.370 Median : 551.1 Median :0.09587
Mean :14.127 Mean : 654.9 Mean :0.09636
3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
Max. :28.110 Max. :2501.0 Max. :0.16340
9. Now k-NN algorithm requires distance to be calculated as finding the difference between the values so in next step normalization is done to get standard range of values as:
normalize<-function(x){return ((x-min(x))/(max(x)-min(x))}
Using normalization all the values for different features will appear to be at the same level.
But using normalize() will be done for single feature at a time, so to apply for a list of elements, use lapply as:
wbcd_n <- as.data.frame(lapply(wbcd[2:31], nomalize))
Now using this all columns from 2 to 31 of wbcd data frame are normalized.
10. Now to check for the ouput of normalized values of all features use below command as:
Min. 1st Qu Median Mean 3rd Qu. Max.
0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
11. Now we will split the wbcd_n data frame into wbcd_train and wbcd_test dataframes as:
wbcd_train <- wbcd_n[1:469,]
and, wbcd_test <- wbcd_n[470:569,]
which are extracted in [row,colmn] syntax fashion, i.e., in wbcd_test rows from 470 to 569 will be saved and in wbcd_train dataframe rows from 1 to 469 will get saved.
12. The above dataframes created does not have diagnosis variable, so to do so use below command:
wbcd_train_labels <- wbcd[1:469,1]
and, wbcd_test_labels <- wbcd[470:569,1]
where diagnosis is the column 1 in wbcd dataframe.
13. Now to implement kNN class package is installed if not already installed as:
install.packages("class")
14. To load the package to use it : library(class).
15. Now use knn() for our problem to get the desired output as vector for each row in the data frame as:
wbcd_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=3)
where k specifies the number of neighbour points od values to be included for the classification.