A Similarity Measure for Text Classification and Clustering
Our Price
₹3,000.00
10000 in stock
Support
Ready to Ship
Description
This paper introduces a measure of similarity between two clustering’s of the same dataset produced by two different algorithms, or even the same algorithm (K-means, for instance, with different initializations usually produce different results in clustering the same dataset). We then apply the measure to calculate the similarity between pairs of clustering’s, with special interest directed at comparing the similarity between various machine clustering’s and human clustering of datasets. The similarity measure thus can be used to identify the best (in terms of most similar to human) clustering algorithm for a specific problem at hand. Experimental results pertaining to the text categorization problem of a Portuguese corpus (wherein a translation-into-English approach is use d) are presented, as well as results on the well-known benchmark dataset. The significance and other potential applications of the proposed measure are discussed. Document similarity measures are crucial components of many text analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: they estimate the surface overlap between Documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people’s judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance.
Tags: 2014, Data Mining Projects, Java