Status Locality on the Web: Implications for Building Focused Collections.

Author List: Pant, Gautam; Srinivasan, Padmini;

Information Systems Research, 2013, Volume 24, Issue 3, Page 802-821.

Topical locality on the Web is the notion that pages tend to link to other topically similar pages and that such similarity decays rapidly with link distance. This supports meaningful Web browsing and searching by information consumers. It also allows topical Web crawlers, programs that fetch pages by following hyperlinks, to harvest topical subsets of the Web for applications such as those in vertical search and business intelligence. We show that the Web exhibits another property that we call "status locality." It is based on the notion that pages tend to link to other pages of similar status (importance) and that this status similarity also decays rapidly with link distance. Analogous to topical locality, status locality may also be exploited by Web crawlers. Collections built by such crawlers include pages that are both topically relevant and also important. This capability is crucial because of the large numbers of Web pages addressing even niche topics. The challenge in exploiting status locality while crawling is that page importance (or status) is typically recognized through global measures computed by processing link data from billion of pages. In contrast, topical Web crawlers depend on local information based on previously downloaded pages. We solve this problem by using methods developed previously that utilize local characteristics of pages to estimate their global status. This leads to the design of new crawlers, specifically of utility-biased crawlers guided by a Cobb-Douglas utility function. Our crawler experiments show that status and topicality of Web collections present a trade-off. An adaptive version of our utility-biased crawler dynamically modifies output elasticities of topicality and status to create Web collections that maintain high average topicality. This can be done while simultaneously achieving significantly higher average status as compared to several benchmarks including a state-of-the-art topical crawler.

Keywords: homophily; predictive models; status locality; topical crawlers

Algorithm:

List of Topics

#33	0.314	web site sites content usability page status pages metrics browsing design use web-based guidelines results implications portal loyalty navigability addition
#291	0.193	local global link complex view links particularly need thought number supports efforts difficult previously linked achieving simple poor individual rise
#26	0.115	business large organizations using work changing rapidly make today's available designed need increasingly recent manage years activity important allow achieve
#37	0.083	intelligence business discovery framework text knowledge new existing visualization based analyzing mining genetic algorithms related techniques large proposed novel artificial
#215	0.069	data classification statistical regression mining models neural methods using analysis techniques performance predictive networks accuracy method variables prediction problem measure
#262	0.052	impact data effect set propensity potential unique increase matching use selection score results self-selection heterogeneity evidence measure associated estimate leads