INFORMATION RETRIEVAL Suggest what normalized form should be used for these word
ID: 3886146 • Letter: I
Question
INFORMATION RETRIEVAL
Suggest what normalized form should be used for these words (including the word itself as a possibility)
a. ‘Cos
b. Shi’te
c. cont’d
d. Hawai’i
e. O’Rourke
The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs, would you argue, should not be conflated? Give a one-sentence reasoning that justifies your response.
a. abandon/abandonment
b. absorbency/absorbent
c. marketing/markets
d. university/universe
e. volume/volumes
A more-like-this query occurs when the user can click on a particular document in the result list and tell the search engine to find documents that are similar to this one. Describe which low-level components are used to answer this type of query and the sequence in which they are used.
Document filtering is an application that stores a large number of queries or user profiles and compares these profiles to every incoming document on a feed. Documents that are sufficiently similar to the profile are forwarded to that person via email or some other mechanism. Describe the architecture of a filtering engine and how it may differ from a search engine.
Why is it better to partition hosts (rather than individuals URLs) between the nodes of a distributed crawl system?
if you can answer any of these
Explanation / Answer
It is better to partition hosts) rather than individual URLs) between the nodes of a distributed crawl system, as the host address usually has direct correspondence with the physical location of a host while the URLs may have nothing to do with it.
By looking at the URLs we cannot say where the corresponding physical machine (cluster) is located, because of many reasons. 1. There are many international domains which are in use throughout the world, i.e. g o o g l e .com, g o o g l e .net
2. Many countries top most domain registries allow its users to sell the domains to the residents of third countries. For example, it is possible to buy a domain in US zone g o o g l e .us while being an UK resident and not planning to use it for mostly US users and keep in USA.
3. Also, even if buying a domain in your national zone to host a website for the local community, it is sometimes better to keep a server abroad for the sake of savings, security and/or other reasons.
Thus, if URL's are distributed by certain URLs, all the nodes will be up crawling the servers all over the earth that will lead to decreased performance, which is exactly opposite in host partition.
So, It is better to partition hosts rather than Individual URL's.