In this assignment we are going to work with a larger collection of tweets (10,0
ID: 3843697 • Letter: I
Question
In this assignment we are going to work with a larger collection of tweets (10,000) that are available here:
http://rasinsrv07.cstcis.cti.depaul.edu/CSC455/Assignment5.txt
A.Using python, identify the top-5 most frequent terms (words separated by ‘ ‘) that are at
least 4 characters or longer (i.e. ignore articles such as “a” or “the” and any other short
terms) in the text of the tweets. It is up to you whether you prefer to use the contents of
the loaded database (reading tweets from SQLite, which contains fewer tweets) or the
contents of the original Assignment5.txt file (reading tweets directly from the file).
Explanation / Answer
#!/usr/bin/python
def printMaximum():
list [] # list to hold unique words
list1[] # list to hold their corresponding counts
with open('Assignment5.txt','r') as f:
for line in f:
for word in line.split():
if len(word) >= 4 :
if word not in list:
list.append(word)
for i in range(len(list)):
count = 0
for j in range(len(list)):
if list[i] == list[j]:
count = count + 1
list1.append(count)
for i in range(5)): # Printing top 5 most frequent words.
index = 0
max = 0
for j in range(len(list1):
if list1[i] > max:
max = list1[i]
index = i
print list[index]
print ' '
list1[index] = 0