Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I have two non-overlapping sets of items, with feature counts for each. What sta

ID: 650831 • Letter: I

Question

I have two non-overlapping sets of items, with feature counts for each. What standard algorithms can I use to extract the most statistically distinct features of each set?

For example:

Items served at American restaurants (5 restaurants surveyed):
bread: 4
burgers: 2
cheese: 1
cronuts: 2
pasta: 2
Items served at Italian restaurants (10 restaurants surveyed):
bread: 7
pasta: 10
cheese: 8

I want to be able to know that cronuts and burgers are distinctly associated with American restaurant menus, and cheese and pasta are distinctly associated with Italian restaurant menus.

Explanation / Answer

This looks like a standard machine learning problem. You could use any machine learning technique. You might start with Naive Bayes.

If you want to evaluate a single feature, you could use information gain or BIC.

For the combination of all features, you can use a machine learning algorithm. As I mentioned, I would suggest trying Naive Bayes first. If you need something more powerful, there are many other classifiers: random forests, SVM's, k-nearest neighbors. Read a textbook on machine learning to learn more about the subject.