Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I would like to determine interests or hobbies that people have given their Twit

ID: 655727 • Letter: I

Question

I would like to determine interests or hobbies that people have given their Twitter timeline data. Their timeline is their historical collection of tweets.

The result I am trying to achieve is to automatically label a user with a collection of interests such as:

Finance
Photography
Android Development

I have tried using a topic model, LDA, using an individuals tweets as the corpus but I am not so sure about the applicability of the results. LDA gives a collection of terms associated with the topics it found. This requires some manual interpretation to distinguish what the interest is.

For example you might get some results from LDA like the following:

[Twitter, Java, time, work, day]
[code, IPhone, mobile, look, job]
[football, fantasy, team, pick, score]

I can generally pick out the topics people are interested in here but its just the key word or topic identifier I am interested in not the other contextual terms like team, pick or score in the last example. Ideally I would like to transform these term-based results into something like the following interests:

[Twitter, Java]
[IPhone, Mobile Development]
[Fantasy Football]

I have considered how to do this using a supervised approach. Given your categories of interest and using a bag-of-words model with a naive Bayes classifier the task would be to determine which category the topic terms belong to.

This requires labelled data for each category of interest which could be quite time consuming to create. Is there a possible unsupervised learning approach that could work?

I may have gone down the wrong path using topic models, if so, what existing textual research areas could help.

Not looking for a complete solution to the problem, just any suggestions, references or links would be appreciated.

Explanation / Answer

Thanks for your comments, reformulating the question lead me to a different way of thinking about the problem.

Getting an interest category list is actually reasonably straightforward for broad level hobbies it is just the labelled training data that may be hard to produce.

I found this paper on unsupervised text classification and this is probably a better fit than topic models in this scenario. It uses a small number of keywords (average 3) per category and segmented sentences to produce training data for a naive Bayes classifier.

I think I will use this with some Twitter specific text pre-processing and some sentiment classification to see if the categories are spoken of in a positive light.