Removing DUST Using Multiple Alignment of Sequences
Rs3,500.00
10000 in stock
SupportDescription
Search engines are the major breakthrough on the web for retrieving the information. But List of retrieved documents contains a high percentage of duplicated and near document result. So there is the need to improve the performance of search results. Some of current search engine use data filtering algorithm which can eliminate duplicate and near duplicate documents to save the users’ time and effort. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. Presence of duplicate documents in the World Wide Web adversely affects crawling, indexing and relevance, which are the core building blocks of web search. The system present a set of techniques to mine rules from URLs and utilize these learnt rules for de-duplication using just URL strings without fetching the content explicitly. In the existing system the system used a very simple alignment heuristic to deal with irrelevant components. It is not publicly available and was not described with enough detail to be implemented. In this project the system propose a use of multiple alignment as a way to avoid the problems of simple pairwise rule extraction. The system proposed DUSTER. It can use it to find and validate rules, by splitting it in training and validating sets. DUSTER learns normalization rules that are very precise in converting distinct URLs which refer the same content to a common canonical form, making it easy to detect them.
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.