Learning URL Patterns for Webpage De-duplication
Our Price
₹2,500.00
10000 in stock
Support
Ready to Ship
Description
In learning URL patterns, duplicate documents in the WWW adversely affects crawling, indexing and relevance, which are the core building blocks of web search. We have use a set of techniques to mine rules from URLs and utilize these rules for de-duplication using just URL strings without fetching the content explicitly. Our technique is composed of mining the crawl logs and utilizing clusters of similar pages to extract transformation rules, which are used to normalize URLs belonging to each cluster. Preserving each mined rule for de-duplication is not efficient due to the large number of such rules. We propose a technique for extracting host specific delimiters and tokens from URLs. We extend the pairwise Rule generation to perform source and target URL selection. We also introduce a machine learning based generalization technique for better precision of Rules. The rule extraction techniques are robust against web-site specific URL conventions. Collectively, these techniques form a robust solution to the de-duplication problem.
Tags: 2012, Java, Network Projects


