Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Hi - I have a similar question for my Statistics class and I am having some trou

ID: 3208252 • Letter: H

Question

Hi - I have a similar question for my Statistics class and I am having some trouble interpreting the dendrogram and scree plot in JMP Pro to determine the best number of clusters to use in this scenario. The question for my class asks us to apply hierarchical clustering and wards method. After I generate the Hierarchical Clustering Report in JMP, I am having issues interpreting the dendrogram and scree plot to determine the "best" number of clusters to use. Any suggestions?

The question I am working on specifically is from the textbook Data Mining for Business Analytics: Concepts, Techniques, and Applications with JMP Pro (Galit Shmueli). Below is the question I am working on and my responses so far to give you a background of where I am having problems arriving at a solution.

14.3 Marketing to Frequent Fliers. The file EastWestAirlinesCluster.jmp contains information on 3999 passengers who belong to an airline’s frequent flier program. For each passenger the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for the purpose of targeting different segments for different types of mileage offers.

a. Apply hierarchical clustering and Ward’s method. Make sure to standardize the data. Use the dendrogram and the scree plot, along with practical considerations, to identify the ‘‘best’’ number of clusters. Use the Color Clusters and Two-Way Clustering option to help interpret the clusters. How many clusters would you select? Why?

The variables in this problem are as follows: (highlighted in green are what I am assuming are the relevant quantitative variables to use for the Clustering Analysis in this case)

Highlighted in green in the table above are the variables I consider to be the relevant quantitative variables to use for the clustering analysis.

In JMP, using the file EastWestAirlinesCluster.jmp, screen shots below show my steps and results for my Hierarchical Clustering Report - Ward Method:

Accessing the Clustering menu in JMP I set the options up as shown below.

Next I assign the relevant quantitative variables to the 'Y, Columns' (at least what I believe to be the relevant variables in this case).

Then I select OK to generate the Hierarchical Cluster Analysis Report

RESULTS:

Dendrogram/ Scree Plot (shown below):

Dendrogram/ Scree Plot

From the Dendrogram above, we can see that the optimal number of clusters may be around 20. If we zoom in on the scree plot, this will further give us reason to think that around 20 clusters may be the optimal number to use.

Scree Plot - taking a closer look:

To interpret the dendrogram a little better, we can apply Color Clusters found by selecting the red triangle to the left of Hierarchical Clustering.

Applying Color Clusters in JMP:

Applying Two Way Clustering in JMP - Color theme used: white (low) to Green (high):

Other Information to Consider:

QUESTIONS:

1.) Should I be including any other variables in my analysis?

2.) Using the information above, what is the best way to determine the "Best Number" of clusters to use? I'm having a hard time interpreting the Dendrogram as well as the Cluster Summary and Column Summary Reports.

3.) Is there any other analysis that I should be performing to arrive at my answer (find the "Best Number" of clusters)?

Any help you can provide would be greatly appreciated. Thank you.

Description of the Variables in the EastWestAirlines Data Set Variable ID # Balance Qual mile cc1_miles Description Passenger ID # (of the 3,999 passengers sampled) Number of miles eligible for award travel Number of miles counted as qualifying for Topflight status Number of miles earned with frequent flyer credit card in the past 12 months: 1 : under 5,000; 2 = 5,000-10,000; 3 = 10,001-25,000; 4 = 25,001-50,000; 5-over 50,000 Number of miles earned with Rewards credit card in the past 12 months: 1- under 5,000; 2 -5,000 10,000; 3 10,001 25,000; 4 25,001 50,000; 5 over 50,000 Number of miles earned with Small Business credit card in the past 12 months: cc2miles cc3_miles under 5,000; 2 = 5,000-10,000; 3-10,001-25,000; 4 = 25,001-50,000; 5 = over 50,000 Number of miles earned from non-flight bonus transactions in the past 12 months Number of non-flight bonus transactions in the past 12 months Bonus miles Bonus trans Flight_miles_12mo Flight trans 12 Days since enroll Award? Number of flight miles in the past 12 months Number of flight transactions in the past 12 months Number of davs since Enroll date Dummy variable for Last award (1-not null, 0-null)

Explanation / Answer

Sol:

yes include'

cc1miles,cc2 miles,cc3 miles variables create dummy variable.

1 Apply hierarchical clustering with Euclidean distance and Ward’s method. Make sure to standardize the data first.

1.5 Use k-means clustering with the number of clusters that you found above

Hierarchical Cluster Aggregate Metrics with Mean