Three files: script01.txt, script02.txt, stopwords.csv. script01 and script02 ar
ID: 3603102 • Letter: T
Question
Three files: script01.txt, script02.txt, stopwords.csv.
script01 and script02 are two scripts.
Eliminate all words in fie 1 and file 2 that are in stopwords file. Find the ten most frequently occurring word frequency pair for the file named script01.txt. Find the counts for those 10 words for the second input file (script02.txt). Your output should be formatted as follows:
You should also have a main file and a module file called module.py. This module file should contain three functions get_text(file_name), process_data(text_data), and print_output(data_dictionary).
PS: they and they're count as different words.
WORD Count 1 Count 2 26 2 3 4 Are Hearsay Stanley Interesting etc 123 121 2Explanation / Answer
print("Welcome! This program will analyze your file to provide a word count, the top 30 words and remove the following stopwords.") s = open('Obama 2009.txt','r').read() # Open the input file # Program will count the characters in text file num_chars = len(s) # Program will count the lines in the text file num_lines = s.count(' ') # Program will call split with no arguments words = s.split() d = {} for w in words: if w in d: d[w] += 1 else: d[w] = 1 num_words = sum(d[w] for w in d) lst = [(d[w],w) for w in d] lst.sort() lst.reverse() # Program assumes user has downloaded an imported stopwords from NLTK from nltk.corpus import stopwords # Import the stop word list from nltk.tokenize import wordpunct_tokenize stop_words = set(stopwords.words('english')) # creating a set makes the searching faster print ([word for word in lst if word not in stop_words]) # Program will print the results print('Your input file has characters = '+str(num_chars)) print('Your input file has lines = '+str(num_lines)) print('Your input file has the following words = '+str(num_words)) print(' The 30 most frequent words are /n') i = 1 for count, word in lst[:50]: print('%2s. %4s %s' %(i,count,word)) i+= 1 print("Thank You! Goodbye.")