Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I need the answer to these two questions: 1. A more-like-this query occurs when

ID: 3872238 • Letter: I

Question

I need the answer to these two questions:

1. A more-like-this query occurs when the user can click on a particular document in the result list and

tell the search engine to find documents that are similar to this one. Describe which low-level

components are used to answer this type of query and the sequence in which they are used.

2. Document filtering is an application that stores a large number of queries or user profiles and

compares these profiles to every incoming document on a feed. Documents that are sufficiently

similar to the profile are forwarded to that person via email or some other mechanism. Describe the

architecture of a filtering engine and how it may differ from a search engine.

Explanation / Answer

Hi thanks for asking best question see i can explan the total information of the search engine . first you need

Search engines match queries against an index that they create. The index consists of the words in each document, + pointers to their locations within the documents. This is called an inverted file. A search engine four essential modules:

Document Processor
The document processor prepares, processes, and inputs the documents, pages, or sites that users search against. The document processor performs some or all of the following steps:


Steps 1-3: Preprocessing. While essential and potentially important in affecting the outcome of a search, these first three steps simply standardize the multiple formats encountered when deriving documents from various providers or handling various Web sites. The steps serve to merge all the data into a single consistent data structure that all the downstream processes can handle. The need for a well-formed, consistent format is of relative importance in direct proportion to the sophistication of later steps of document processing. Step two is important because the pointers stored in the inverted file will enable a system to retrieve various sized units — either site, page, document, section, paragraph, or sentence.

Step 4: Identify elements to index. Identifying potential indexable elements in documents dramatically affects the nature and quality of the document representation that the engine will search against. In designing the system, we must define the word "term." Is it the alpha-numeric characters between blank spaces or punctuation? If so, what about non-compositional phrases (phrases in which the separate words do not convey the meaning of the phrase, like "skunk works" or "hot dog"), multi-word proper names, or inter-word symbols such as hyphens or apostrophes that can denote the difference between "small business men" versus small-business men." Each search engine depends on a set of rules that its document processor must execute to determine what action is to be taken by the "tokenizer," i.e. the software used to define a term suitable for indexing.

Step 5: Deleting stop words. This step helps save system resources by eliminating from further processing, as well as potential matching, those terms that have little value in finding useful documents in response to a customer's query. This step used to matter much more than it does now when memory has become so much cheaper and systems so much faster, but since stop words may comprise up to 40 percent of text words in a document, it still has some significance. A stop word list typically consists of those word classes known to convey little substantive meaning, such as articles (a, the), conjunctions (and, but), interjections (oh, but), prepositions (in, over), pronouns (he, it), and forms of the "to be" verb (is, are). To delete stop words, an algorithm compares index term candidates in the documents against a stop word list and eliminates certain terms from inclusion in the index for searching.

Step 6: Term Stemming. Stemming removes word suffixes, perhaps recursively in layer after layer of processing. The process has two goals. In terms of efficiency, stemming reduces the number of unique words in the index, which in turn reduces the storage space required for the index and speeds up the search process. In terms of effectiveness, stemming improves recall by reducing all forms of the word to a base or stemmed form. For example, if a user asks for analyze, they may also want documents which contain analysis, analyzing, analyzer, analyzes, and analyzed. Therefore, the document processor stems document terms to analy- so that documents which include various forms of analy- will have equal likelihood of being retrieved; this would not occur if the engine only indexed variant forms separately and required the user to enter all. Of course, stemming does have a downside. It may negatively affect precision in that all forms of a stem will match, when, in fact, a successful query for the user would have come from matching only the word form actually used in the query.

Systems may implement either a strong stemming algorithm or a weak stemming algorithm. A strong stemming algorithm will strip off both inflectional suffixes (-s, -es, -ed) and derivational suffixes (-able, -aciousness, -ability), while a weak stemming algorithm will strip off only the inflectional suffixes (-s, -es, -ed).

Step 7: Extract index entries. Having completed steps 1 through 6, the document processor extracts the remaining entries from the original document. For example, the following paragraph shows the full text sent to a search engine for processing:

Milosevic's comments, carried by the official news agency Tanjug, cast doubt over the governments at the talks, which the international community has called to try to prevent an all-out war in the Serbian province. "President Milosevic said it was well known that Serbia and Yugoslavia were firmly committed to resolving problems in Kosovo, which is an integral part of Serbia, peacefully in Serbia with the participation of the representatives of all ethnic communities," Tanjug said. Milosevic was speaking during a meeting with British Foreign Secretary Robin Cook, who delivered an ultimatum to attend negotiations in a week's time on an autonomy proposal for Kosovo with ethnic Albanian leaders from the province. Cook earlier told a conference that Milosevic had agreed to study the proposal.

Steps 1 to 6 reduce this text for searching to the following:

Milosevic comm carri offic new agen Tanjug cast doubt govern talk interna commun call try prevent all-out war Serb province President Milosevic said well known Serbia Yugoslavia firm commit resolv problem Kosovo integr part Serbia peace Serbia particip representa ethnic commun Tanjug said Milosevic speak meeti British Foreign Secretary Robin Cook deliver ultimat attend negoti week time autonomy propos Kosovo ethnic Alban lead province Cook earl told conference Milosevic agree study propos.

The output of step 7 is then inserted and stored in an inverted file that lists the index entries and an indication of their position and frequency of occurrence. The specific nature of the index entries, however, will vary based on the decision in Step 4 concerning what constitutes an "indexable term." More sophisticated document processors will have phrase recognizers, as well as Named Entity recognizers and Categorizers, to insure index entries such as Milosevic are tagged as a Person and entries such as Yugoslavia and Serbia as Countries.

Step 8: Term weight assignment. Weights are assigned to terms in the index file. The simplest of search engines just assign a binary weight: 1 for presence and 0 for absence. The more sophisticated the search engine, the more complex the weighting scheme. Measuring the frequency of occurrence of a term in the document creates more sophisticated weighting, with length-normalization of frequencies still more sophisticated. Extensive experience in information retrieval research over many years has clearly demonstrated that the optimal weighting comes from use of "tf/idf." This algorithm measures the frequency of occurrence of each term within a document. Then it compares that frequency against the frequency of occurrence in the entire database.

Not all terms are good "discriminators" — that is, all terms do not single out one document from another very well. A simple example would be the word "the." This word appears in too many documents to help distinguish one from another. A less obvious example would be the word "antibiotic." In a sports database when we compare each document to the database as a whole, the term "antibiotic" would probably be a good discriminator among documents, and therefore would be assigned a high weight. Conversely, in a database devoted to health or medicine, "antibiotic" would probably be a poor discriminator, since it occurs very often. The TF/IDF weighting scheme assigns higher weights to those terms that really distinguish one document from the others.

Step 9: Create index. The index or inverted file is the internal data structure that stores the index information and that will be searched for each query. Inverted files range from a simple listing of every alpha-numeric sequence in a set of documents/pages being indexed along with the overall identifying numbers of the documents in which the sequence occurs, to a more linguistically complex list of entries, the tf/idf weights, and pointers to where inside each document the term occurs. The more complete the information in the index, the better the search results.

Document Filtering :

Filter by Attribute

Filtering by attribute enables you to select three kinds of results:

To filter by attribute, add a more:pagemap:TYPE-NAME:VALUE operator to a search query. This restricts search results to pages which have structured data that exactly matches that type, name and value. (Custom Search will convert up to 200 attributes per page.) Attributes should not be more than 128 characters long. You can generalize this operator by omitting VALUE to match all instances of the named field or omitting -NAME:VALUEto match all objects of a given type.

To see how the complete operator is constructed from structured data, recall the example we used earlier:

[halloween more:pagemap:document-author:lisamorton]

Breaking down the more:pagemap:document-author:lisamorton restriction in more detail, the more:operator is what Custom Search uses for refinement labels, the pagemap: part of the refinement tells us to refine results by specific attributes in the indexed PageMaps, and the remaining elements of the operator—document-author and lisamorton—specify the content the restriction drills down into. Recall the PageMap from the example:

<PageMap> <DataObject type="document"> <Attribute name="title">The Five Scariest Traditional Halloween Stories</Attribute> <Attribute name="author">lisamorton</Attribute> </DataObject> </PageMap>

The document-author: qualifier of the operator tells us to look for the DataObject with type document with an Attribute named author. This structured data key is followed by the value lisamorton, which must match exactly the value of the Attribute to be returned in a search containing this restriction.

more:p:document-author:lisamorton

When filtering by Attribute, you can create more complex filters (and shorter commands) by using a compact query. For instance, you could add the following PageMap for a URL:

<pagemap>
<DataObject type="document">
<Attribute name="keywords">horror</Attribute>
<Attribute name="keywords">fiction</Attribute>
<Attribute name="keywords">Irish</Attribute>
</DataObject>
</pagemap>
</page>

To retrieve results for the query "Irish AND fiction", use the following:

more:p:document-keywords:irish*fiction

This is equivalent to more:pagemap:document-keywords:Irish more:pagemap:document-keywords:fiction.

To retrieve the results for "Irish AND (fiction OR horror)", use the following:

more:p:document-keywords:irish*fiction,irish*horror

Thanks for asking.