Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Can someone explain this question? I know the 4-bin equal-width technique. And I

ID: 3749390 • Letter: C

Question

Can someone explain this question? I know the 4-bin equal-width technique. And I know how to sort the range.
However, I do not understand how to use discretisation preprocessing for NMI calculations. Can you tell me how?
NMI is normalised mutual information.
3.2 Mutual Information: to calculate the mutual information between In this question, you will use the provided mutual infoO and my_entropy0 functions discussed in workshop-week7 different attributes and the class label Since discretisation is an important preprocessing for MI calculations, you will be using the X" data matrix G e. without scaling) in this question. By looking at "X-data matrix, attributes can be spat into 11 discrete and 14 numerical (4 Marks) Discrete columns are [is male', 'grad_schoof, university, high schoor, is_married, pay 1, pay_2, pay 3, pay 4, pay 5, pay 61 and Numerical columns are [limit bar, age, bill amt, bill amt2, bill amt3 bill am4, bill amts, vill amt6, 'pay amt, pay amtz, 'pay amt3, pay amtr pay amt5, pay amt6] First, you need to discretize each of the numerical attributes by using 4-bin equal-width technique. The output for this step should have the following format bin# 1 range [a,b) bin# 2 range [a,b) bin# 3: range [a,b) bin# 4 range [a,b) ### pay_am16##FFFFFF###### ngela bint 2 range [a,b) bint 3 range [a,b) bint 4: range [a,b) where a and b are the calculated range for each bin Next, calculate the normalised mutual information (NMI) between each of the columns (i.e. 25 in total) and the class label y. Then, display a bar plot for the calculated NMI values. The plot should contains 25 bars, each bar represent the NMI value between attribute i and label y. The format of the bar plot should be similar to the below plot. Note that the NMI values in this plot are randomly generated and you will get different bars/values for your bar plot Normalized mutual information for credit card users dataset 1.0 08 0 6 04 02

Explanation / Answer

It is something really interseting to know that you want to know about the concepts rathher than the solution.

Data discretization means in simple term you can say that labelling the data or divding the data into number of classes. This term is basically used in data processing, after data cleaning, transforming and reducing.

As you can see, yoiu have two kinds of data,

1. Discrete columns,

2. Normalized columns

As the name suggest, discrete columns already have data, discrete in nature, or you can say that data is already divided into classes, for example, is_male is either true or false, there is no third value possible for is_male, or grad_scool, high_school, has some set of definite range. It is somewhat tricky to understand how grad_school or high_scool comes in discrete form of data even if there are thousands of school. So, even if there are thousands of school, but they are not indefinite, we can still count them and sum the range of them. Similary other discrete columns.

On the other hand, numerical columns have data which don't have certain fixed range of values, for example, limit_bar may have value 0, or 1000, or any other number. So, as bill_amt1 and so on. i.e, they don't have classes or subdivision or particular range.

So, genrally in data processing, after data cleaning, transforming and reducing we do one more step called data discretization. We divide numerical data into discrete forms of data, to handle it more carefully or to simplify the processing.

Generally, for numerical data, we create bins of data and each bin contain the range of data.

Genarlly, the number of bins are not fixed, instead of that the range of bins we used to fix.

For example, suppose we have, 1,2,5,8,19,25,30,64,125,167,198,200.

So, to calculate the range of bins, we will use, r = (max-min)/N,

where, r = range of each bin,

N = best suited division of bins, here N = 4(given)

hence, r = (200 - 1)/4 = 50.

Hence, we will have 4 bins, and each will contains numbers in increasing range of 1-200 in difference of 50,

i.e, bin1 will have all numbers between, 1 to 50, i.e, 1,2,5,8,19,25,30

and, bin2 will have all numbers between, 51 to 100, i.e, 64

and, bin3 will have all numbers between, 101 to 150, i.e, 125

and, bin4 will have all numbers between, 151 to 200, i.e, 167, 198, 200.

This is called discretization and this is really important concepts in data processing in ML.