Henzinger et. al.
users of web search engines tend to examine only the first page of search results. for commercially-oriented web sites whose income depends on their traffic, it is in their interest to be ranked within the top 10 results for a query relevant to the content of the web site.
to achieve high rankings, authors either use a text-based approach, a link-based approach, a cloaking approach, or a combination thereof.
traditional research in information retrieval has not had to deal with this problem of malicious content in the corpora.
the web is full of noisy, low-quality, unreliable, and indeed contradictory content. In designing a high-quality search engine, one has to start with the assumption that a typical document cannot be "trusted" in isolation, rather it is the synthesis of a large number of low-quality documnets that provides the best set of results.
layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data.
There are two way to try to improve ranking. one is to concentrate on a small set of keywords, and try to improve perceived relevance for that set of keywords. another technique is to try and increase the number of keywords for which the document is perceived relevant by a search engine.
a common approach is for an author to put a link farm at the bottom of every page in a site, where a link farm is a collection of links that points to every other page in that site, or indeed to any site that the author controls.
doorway pages are web pages that consist entirely of links. they are not intended to be viewed by humans; rather, they are constructed in a way that makes it very likely that search engnes will discover them.
cloaking involves serving entirely different content to a search engine crawler than to other users.
while there has been a great deal of research on determining the relevance of documents, the issue of document quality of accuracy has not been received much attention.
another promising area of research is to combine established link-analysis quality judgments with text-based judgments.
three assumed web conventions:
1) anchor text is meant to be descriptive
2) assume that if a web page author includes a link to another page, it is because the author believes that readers of the source page will find the destination page interesting and relevant.
3) META tags: currently the primary way to include metadata within HTML. content META tag used to describe the content of the document.
duplicate hosts are the single largest source of duplicate pages on the web, so solving the dupicate hosts problem can result ina significantly improved web crawler.
vaguely-structured date: information on these web pages is not structured in a database sense, typically it's much closer to prose than to data, but it does have some structure, often unintentional, exhibited through the use of HTML markup. not typically the intent of the webpage author to describe the documents semantics.
Hawking, pt. 1
search engines cannot and should not index every page on the web.
crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue. when the crawler fetches the page, it scans the contents for links to other URLs and adds each previously unseesn URL to the queue.
even hundredfold parallelism is not sufficient to achieve the necessary crawling rate.
robots.txt file to determine whether the webmaster has specified that some or all of the site should not be crawled.
search engine companies use manual and automated analysis of link patterns and content to identify spam sites that are then included in a blacklist.
crawlers are highly complex parallel systems, communication with millions of different web servers, among which can be found every conceivable failure mode, all manner of deliberate and adcidental crawler traps, and every variety of noncompliance with published standards.
Hawking, pt. 2
search engines use an inverted file to rapidly identify indexing terms - the documents that contain a particular word or phrase.
in the first phase, scanning, the indexer scans the text of each input document.
in the scond phase, inversion, the indexer sorts the temporary file into term number order, with the document number as the secondary sort key.
the scale of the inversion problem for a web-sized crawl is enormous.
there is a strong economic incentive for serach engines to use caching to reduce the cost of answering queries.
Lesk, ch. 4
Friday, October 10, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment