Consider the following document-term table with 10 documents and 8 terms (A thro
ID: 3783780 • Letter: C
Question
Consider the following document-term table with 10 documents and 8 terms (A through H) containing raw term frequencies. We also have a specified query, Q, with the indicated raw term weights (the bottom row in the table). Answer the following questions, and in each case give the formulas you used to perform the necessary computations. Alternatively, you can write a program to perform the computations. Please include your w or code in the assignment submission). [Download the table below as an Excel Spreadsheet] Compute the ranking score for each document based on each of the following query-document similarity measures (sort the documents in the decreasing order of the rank score): dot product Cosine similarity Dice's coefficient Jaccard's Coefficient Compare the ranking obtained when, instead, binary term weights are used to the ranking obtained in part a where raw term weights were used (do this only with dot product as the similarity measure). Explain any discrepancy between the two rankings. Construct a similar table to above, but instead of raw term frequencies compute the (non-normalized) tfxidf weights for the terms Then compute the ranking scores using the Cosine similarity. Explain any significant differences between the ranking you obtained here and the Cosine ranking from the previous part.Explanation / Answer
Vector operations can be used to compare documents with queries.
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.
Documents and queries are represented as vectors.
{displaystyle d_{j}=(w_{1,j},w_{2,j},dotsc ,w_{t,j})} d_j = ( w_{1,j} ,w_{2,j} , dotsc ,w_{t,j} )
{displaystyle q=(w_{1,q},w_{2,q},dotsc ,w_{n,q})} q = ( w_{1,q} ,w_{2,q} , dotsc ,w_{n,q} )
Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting (see the example below).
The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).