Assignment 2 : R Programming Upload R code into Week 2 Dropbox. Download insuran
ID: 3883316 • Letter: A
Question
Assignment 2 : R Programming
Upload R code into Week 2 Dropbox.
Download insurance.csv file from Doc Sharing. Read into a data frame named insuranceData using data.table() with the following options (check Week 1 lecture notes and Chapter 2 from textbook.)
header=T,stringsAsFactors=F
Data has 7 columns and 1338 rows. The data contains information for health insurance charges based on the age, sex, bmi, number of children, smoking, and the region of the country where the family lives.
[w2h1]
A) Print the name of the columns.
B) Print the number of rows and columns.
C) Count the number of males and females in the data.
Hint: Lecture notes have samples on counting items in vectors, e.g., the table() function.
D) Find mean, median,standard deviation, and variance of age and bmi.
The R functions to be used are mean(), median(), sd(), var().
E) Find maximum and minimum values of age, bmi, and children.
F) Use summary() function to print information about the distribution of the insurance data. What are the min and max values printed by the summary() function for the age, bmi, children, and charges?
G)Use summary() function to print distribution information of the age column.
Check textbook page 34 for a sample.
H) Use unique() function to print the name of distinct regions.
I) Extract the subset of insurance data that has three children.
Hint: Use subset() function. Check lecture notes and textbook for samples.
J) Extract the subset of insurance data with charges more than 30000.
K) Extract the subset of insurance data for females living in southwest region.
L) Extract the subset of insurance data for males living in northwest region with more than 2 children.
M) Use class() function to print the type of R object for each column of the insurance data frame.
Hint: Textbook chapter 2.
N) Use class() function to print the type of the smoker column. Convert smoker column to a factor type. How many levels are created when you convert the smoker column to factor type? What would be the reason you want to convert the smoker column type from character to a factor type?
Hint: You can get information about the levels by just printing the smoker column after conversion. Check lecture notes.
O) Use summary() function to print the summary statistics for the smoker column? What is the result of using summary() function on a data type of factor.
To see the differences of using the summary() function on different data types print the result of summary for the region, age, and smoker. What are the differences?
This is an example to show that summary() function reports different statistics for numeric and categorical data(i.e., factors).
Explanation / Answer
library(data.table)
insurance_data <- fread("insurance.csv",header=T,stringsAsFactors=F)
insurance_data <- data.frame(insurance_data)
#A) Print the name of the columns.
colnames(insurance_data)
#B) Print the number of rows and columns.
dim(insurance_data)
#C) Count the number of males and females in the data.
table(insurance_data$sex)
#D) Find mean, median,standard deviation, and variance of age and bmi.
mean(insurance_data$age)
mean(insurance_data$bmi)
median(insurance_data$age)
median(insurance_data$bmi)
sd(insurance_data$age)
sd(insurance_data$bmi)
var(insurance_data$age)
var(insurance_data$bmi)
#Note : make sure data type of age and bmi must be either integer or numeric and no NA's
#E) Find maximum and minimum values of age, bmi, and children.
max(insurance_data$age)
max(insurance_data$bmi)
max(insurance_data$children)
min(insurance_data$age)
min(insurance_data$bmi)
min(insurance_data$children)
#F) Use summary() function to print information about the distribution of the insurance data.
summary(insurance_data)
##What are the min and max values printed by the summary() function for the age, bmi, children, and charges?
#Solution : Just go through the summary of each column and look for min and max values corresponding to it.
#G)Use summary() function to print distribution information of the age column.
summary(insurance_data$age)
#H) Use unique() function to print the name of distinct regions.
unique(insurance_data$region)
#Note: after dollar symbol I have used column name as "region", you give according to the data set you have
#if you want to see column names use command : colnames(insurance_data)
#I) Extract the subset of insurance data that has three children.
insurance_children = subset(insurance_data, children == 3)
insurance_children # to see the subsetted data frame
#J) Extract the subset of insurance data with charges more than 30000.
insurance_charges = subset(insurance_data, charges > 30000)
#K) Extract the subset of insurance data for females living in southwest region.
insurance_region_sw = subset(insurance_data, sex = ="females" & region == "southwest")
#L) Extract the subset of insurance data for males living in northwest region with more than 2 children.
insurance_region_nw = subset(insurance_data, sex == "males" & region == "northwest" & children >2)
#M) Use class() function to print the type of R object for each column of the insurance data frame.
sapply(insurance_data,class) # in one go if you want to list data type of each columns
# or use class function separately for each column
class(insurance_data$age)
class(insurance_data$sex) # likewise, you can know data type of desired column
#N) Use class() function to print the type of the smoker column. Convert smoker column to a factor type.
class(insurance_data$smoker)
#if the data type is character then convert to factor
insurance_data$smoker <- as.factor(insurance_data$smoker)
#How many levels are created when you convert the smoker column to factor type?
nlevels(insurance_data$smoker)
##What would be the reason you want to convert the smoker column type from character to a factor type?
#Solution : Factors store character values as unique character value is stored only once,
#and the data itself is stored as a vector of integers.
#O) Use summary() function to print the summary statistics for the smoker column?
summary(insurance_data$smoker)
##What is the result of using summary() function on a data type of factor.
#Solution: copy and paste the output of command : summary(insurance_data$smoker)
#To see the differences of using the summary() function on different data types
#print the result of summary for the region, age, and smoker. What are the differences?
summary(insurance_data$region)
summary(insurance_data$age)
summary(insurance_data$smoker)
##Solution :
#If region is character then, you will get summary result as, Length, Class, Mode
#If age is integer then, you will get min, max, mean, median, etc
#If smoker is factor the, you will get levels name with its count
Notes:
=> In above code, data frame is named, "insurance_data", you can give any meaningful name
=> Use proper column names, as mentioned, in the original data set, and wherever change is required, do it.
I don't have access to your data set, thats why, as per your question statement I have given column name like, age, sex, smoker, etc.
I hope you will be able to run the above code in a lucid way. If you find any trouble running it, and the concept is not understood, please contact me.
If all goes fine, and you are satisfied then please give feedback too :-).
Sincerely,
Subodh Kumar