Please explain each answer. Thanks! Overfitting Data scientists in your departme
ID: 3336980 • Letter: P
Question
Please explain each answer. Thanks!
Overfitting Data scientists in your department have presented proposals for two models to predict customer defections. Model A uses all 34 variables available on customers (including attributes on demographics, location, purchase history, etc.). Model B is simpler and uses only three of the 34 variables: 1) the tenure of the customer with your firm, 2) the total dollar volume of their purchases and 3) an indicator of whether the customer has had a recent service complaint. The analysts fit the two models to a training data set of 1500 customers. They then tested the model on an independent test set of 500 customers. The predictive performance (total error rate) for each model was then measured on both the training and test sets. Answer true (T) or false (F) to the following: 1. 2. 3. 4. Model A is likely to have a lower error rate than Model B on the training data. Model A is likely to have a lower error rate on the training data than on the test data. Model B is likely to have a higher error rate on the training data than on the test data. Suppose the error rate of Model B is lower than the error rate of Model A on the test data, then this indicates Model A is over-fit to the training data. Because the training data set has more observations, the most accurate estimate of the actual performance of the two models is the error rate measured on the training data. 5.Explanation / Answer
(1) True
Model A is using larger number of variables than model B so model A will have higher R2 value and the error rate of model A on the training data set will be lower than model B.
(2) True
Model A is using large number of variables and hence on training data set the model will fit properly and give lower error rate in comparison to test data set where the model is run on a new data set and can give large error rate.
(3) False
Here model B is having only 3 variable so when it is run on training and test data set then the error rate on test data set can also be lower than the training data set.
(4) True
If the model A has over-fit the training data set then in that case it will give higher error rate on test data.
(5) False
The performance of any model is evaluated based on how it performs on a new data set or test data. If the model will give lower error rate on test data then that model is said to be a good model.