The t - 6 terms: T1: bak(e,ing) T2 : recipes T3: bread T4 cake T5: pastr(y,ies)
ID: 3602057 • Letter: T
Question
The t - 6 terms: T1: bak(e,ing) T2 : recipes T3: bread T4 cake T5: pastr(y,ies) T6: pie The d-5 document titles: D1: D2: D3: D4: D5: How to Bake Bread Without Recipes The Classic Art of Viennese Pastry Numerical Recipes: The Art of Scientific Computing Breads. Pastries, Pies and Cakes : Quantity Baking Recipes Pastry: A Book of Best French Recipes The 6 × 5 term-by-document matrix before normalization, where the element âij is the number of times term i appears in document title j 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 The 6 × 5 term-by-document matrix with unit columns 0 0.4082 0 0 0.4082 0 0.5774 0 0.5774 01.0000 0.4082 0.7071 0.5774 0 0 0.4082 0 0 1.0000 0 0.4082 0.7071 0 0.4082 0Explanation / Answer
Traditional indexing mech-
anisms for scientific research papers are constructed from information such as their
titles, author lists, abstracts, key word lists, and subject classifications. It is not
necessary to read any of those items in order to understand a paper: they exist pri-
marily to enable researchers to find the paper in a literature search. For example,
the key words and subject classifications listed above enumerate what we consider to
be the major mathematical topics covered in this paper. In particular, the subject
classification 68P20 identifies this paper as one concerned with information retrieval
(IR). Before the advent of modern computing systems, researchers seeking particular
information could only search through the indexing information manually, perhaps
Even when subsets of data can be managed manually, it is difficult to maintain
consistency in human-generated indexes: the extraction of concepts and key words
from documentation can depend on the experiences and opinions of the indexer. De-
cisions about important key words and concepts can be based on such attributes as
age, cultural background, education, language, and even political bias. For instance,
while we chose to include only higher-level concepts in this paper’s key word list, a
reader might think that the words
vector
and
matrix
should also have been selected.
Our editor noted that the words
expository
and
application
did not appear in the list
even though they describe the main purpose of this paper. Experiments have shown
that there is a 20% disparity on average in the terms chosen as appropriate to describe
a given document by two different professional indexers [28].
These problems of scale and consistency have fueled the development of auto-
mated IR techniques. When implemented on high-performance computer systems,
such methods can be applied to extremely large databases, and they can, without
prejudice, model the concept–document association patterns that constitute the
se-
mantic structure
of a document collection. Nonetheless, while automated systems
are the answer to some concerns of information management, they have their own
problems. Disparities between the vocabulary of the systems’ authors and that of
their users pose difficulties when information is processed without human interven-
tion. Complexities of language itself present other quandaries. Words can have many
meanings: a
bank
can be a section of computer memory, a financial institution, a steep
slope, a collection of some sort, an airplane maneuver, or even a billiard shot. It can
be hard to distinguish those meanings automatically. Similarly, authors of medical
literature may write about
myocardial infarctions
, but the person who has had a mi-
nor
heart attack
may not realize that the two phrases are synonymous when using the
public library’s on-line catalog to search for information on treatments and prognosis.
Formally,
polysemy
(words having multiple meanings) and
synonymy
(multiple words
having the same meaning) are two major obstacles to retrieving relevant information
from a database