Please use the following to create a data frame. Loan_Status indicates the appro
ID: 3069035 • Letter: P
Question
Please use the following to create a data frame. Loan_Status indicates the approval of each loan application: Y for approved and N for declined $ Coapplicantincome:num 01508 0 23580 $ LoanAmount int NA 128 66 120 141 267 95 158 168 349. $Loan Amount Term: int 360 360 360 360 360 360 360 360 360 360 S Property Area :Factor w/ 3 levels "Rural", "Semiurban"3 13 333 32 32. $ Loan Status Factor w/ 2 levels "N" Y: 2 122222121. > #Seeting the random seed >set.seed(100) > #Loading the hackathon dataset » data loanappe-read.csv(url'https://datahack-prod s3.ap-south-1.ama- zonaws.com/train file/train u6lujuX CVtuZ9i.csv) >fLet's chec the data structure of the loaded dataset >str data loanapp) data frame: 614 obs, of 13 variables: Explore the data frame, identify and report the missing data. How will Loan ID Factor w/ 614 levels "LPOO1002","LPO01003" 1234 5678910 $ Gender Factor w/3 levels "."Female","Male":3333 3 3333 you deal with the missing data? Create the variable "aggregatedincome" for each loan application. Define and create your own three categories of "aggregatedincome" high, medium, and low $ Married Factor w/3 levels" "No" Yes": 23 33233333 $ Dependents Factor w/ 5 levels"". "".2 322 242543 In each your defined categories of "aggregatedincome", what percent- age of applications received their loan approvals? $ Education Factor w/ 2 levels "Graduate"Not Graduate: 1112 112111. $ Self Employed Factor w/3 levels "."No". "Yes": 22 32232222... $ Applicantincome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 Comparing with loan sizes, will you conclude any insight?Explanation / Answer
NOTE: All the code will be indented to the right. The explanation will be un-indented.
I have used the R statistical programming language to answer this question (as is implied by the question).
After running the initial sets of commands mentioned in the question, I got down to answering the questions. Here is the code used to answer the questions:
# Summary of the data.
summary(data_loanapp)
# Dimensions of the data.
dim(data_loanapp)
# Sum of missing values. 86 missing values in the dataset
sum(is.na(data_loanapp))
# Column wise missing values. 3 columns have missing values.
colSums(is.na(data_loanapp))
I have mentioned the code for 4 imputation techniques. These techniques can be used for imputing missing values. The technique used every time must be specific to the problem being solved.
1. Removing rows with missing values.
2. Imputing missing values with 0.
3. Central imputation.
4. KNN imputation.
# Technique 1: Complete.cases
# Removing the rows with missing values; 'complete.cases' function in R is used
data_complete_cases= data_loanapp
data_complete_cases= data_complete_cases[complete.cases(data_complete_cases),]
sum(is.na(data_complete_cases))
# Technique 2: Imputation with 0
# Imputing missing values with 0
data_zero_imputation= data_loanapp
data_zero_imputation[is.na(data_zero_imputation)]= 0
sum(is.na(data_zero_imputation))
# Technique 3: Central imputation
# Imputing missing values with mean/ mode (as applicable)
data_central= data_loanapp
data_central= centralImputation(data_central)
sum(is.na(data_central))
# Technique 4: KNN imputation
# Imputing missin values using nearest neighbours
data_knn= data_loanapp
data_knn= knnImputation(data_knn)
sum(is.na(data_knn))
I have used the dataset withmissing values imputed using 'central imputation' to answer the further questions.
I have created a new column 'aggregatedIncome' by adding the values of applicant income and coapplicant income.
data_central$aggregatedIncome= data_central$ApplicantIncome + data_central$CoapplicantIncome
I have then converted this aggregated income to 3 levels: 'Low', 'Medium', 'High'.
summary(data_central)
data_central$aggregatedIncome= ifelse(data_central$aggregatedIncome<3000,"Low",
ifelse(data_central$aggregatedIncome<10000,"Medium","High"))
table(data_central$aggregatedIncome)
The following code gives the percentage of applications which received loan approvals, for each of the categories:
# Percentage of approvals for the 'High' category. 65% approval rate.
table(data_central$aggregatedIncome, data_central$Loan_Status)[4]/(table(data_central$aggregatedIncome, data_central$Loan_Status)[4]
+table(data_central$aggregatedIncome, data_central$Loan_Status)[1]
)*100
# Percentage of approvals for the 'Low' category. 56% approval rate.
table(data_central$aggregatedIncome, data_central$Loan_Status)[4]/(table(data_central$aggregatedIncome, data_central$Loan_Status)[4]
+table(data_central$aggregatedIncome, data_central$Loan_Status)[1]
)*100
# Percentage of approvals for the 'Medium' category. 70% approval rate.
table(data_central$aggregatedIncome, data_central$Loan_Status)[4]/(table(data_central$aggregatedIncome, data_central$Loan_Status)[4]
+table(data_central$aggregatedIncome, data_central$Loan_Status)[1]
)*100
Now, we must compare loan sizes and see if we can come up with any insights.
The following code produces a 2x2 table which compares loan sizes with loan approval:
table(data_central$LoanAmount>175,data_central$Loan_Status)
We can clearly see that loans greater than $175 have a lesser approval rate (1.8 approvals for 1 rejection) than loans less than $175 (2.31 approvals for 1 rejection).
Thus, we can conclude that the chances of approval of smaller loans is better than the chances of approval od bigger loans, ceterus paribus.
If you have any other doubts related to the question, or if you think I have missed out the answer to any of the parts, please feel free to comment in the comments section below, and I will resolve them. Happy learning!