Consider the following document-term table containing raw term frequencies. Answ
ID: 3783008 • Letter: C
Question
Consider the following document-term table containing raw term frequencies. Answer the following questions, and in each case give the formulas you used to perform the necessary computations. a. Compute the new weights for all the terms in document DOC4 using the tf x idf approach. b. Compute the new weights for all the terms in documents DOC4 using the signal-to-noise ratio approach. c. Using the Keyword Discrimination approach, determine if Term4 is a good index term or not (by computing it's discriminant). To compute average similarities use Cosine similarities use Cosine similarity as your similarity measure. Show your work.Explanation / Answer
Answer: See the details below:
1. For Doc4, tf-idf table is given below:
Note the following:
1. tf stands for term frequency which tell the number of times a term occurrs in a document.
2. idf stands for inverse document frequency which is the log of ratio of total number of documents to the number of documents in which concerned term occurrs.
3. tf-idf is the product of tf and idf (tf*idf).
2. For Doc 6, Signal-to-Noise ratio is given below:
Note the following:
1. Noise for a term can be calculated as per formula:
(term freq/total freq)*log2(total freq/term freq)
2. Signal-to-Noise ratio is given by:
term freq*(log2(total freq) - term noise)
3. Calculation for Term Discriminant: See the table below
Note the following:
Term discriminant is the difference of average of similarities calculated between documents including concerned term and average of similarities calculated between documents excluding the term. If it is greater that zero, then term is a good indexing term, if it is less than zero, then term can not be referred as a good term for indexing.While if term discriminant is 0 or almost zero then it is not that good for indexing.
The term discriminant value for Term4 as calculated in table above is 1.733333 which is quite good. Hence, it can be a good term for indexing.
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8 Doc1 0 3 1 0 0 2 1 0 Doc2 5 0 0 0 3 0 0 2 Doc3 3 0 4 3 4 0 0 5 Doc4 1 8 0 3 0 1 4 0 Doc5 0 1 0 0 0 5 4 2 Doc6 2 0 2 0 0 4 0 1 Doc7 2 5 0 3 0 1 4 2 Doc8 3 3 0 2 0 0 1 3 Doc9 0 0 3 3 3 0 0 0 Doc10 1 0 5 0 2 4 0 2 10 7 5 5 5 4 6 5 7 IDF 0.356675 0.693147 0.693147 0.693147 0.916291 0.510826 0.693147 0.356675 Doc4 tf-idf weights 0.356675 5.545177 0 2.079442 0 0.510826 2.772589 0