I have two non-overlapping sets of items, with feature counts for each. What sta
ID: 650831 • Letter: I
Question
I have two non-overlapping sets of items, with feature counts for each. What standard algorithms can I use to extract the most statistically distinct features of each set?
For example:
Items served at American restaurants (5 restaurants surveyed):
bread: 4
burgers: 2
cheese: 1
cronuts: 2
pasta: 2
Items served at Italian restaurants (10 restaurants surveyed):
bread: 7
pasta: 10
cheese: 8
I want to be able to know that cronuts and burgers are distinctly associated with American restaurant menus, and cheese and pasta are distinctly associated with Italian restaurant menus.
Explanation / Answer
This looks like a standard machine learning problem. You could use any machine learning technique. You might start with Naive Bayes.
If you want to evaluate a single feature, you could use information gain or BIC.
For the combination of all features, you can use a machine learning algorithm. As I mentioned, I would suggest trying Naive Bayes first. If you need something more powerful, there are many other classifiers: random forests, SVM's, k-nearest neighbors. Read a textbook on machine learning to learn more about the subject.