Description
E-mail communication is indispensable nowadays, but the e-mail spam problem continues growing drastically. In recent years, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed. The primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent near-duplicate spams. On purpose of achieving efficient similarity matching and reducing storage utilization, prior works mainly represent each e-mail by a succinct abstraction derived from e-mail content text. However, these abstractions of e-mails cannot fully catch the evolving nature of spams, and are thus not effective enough in near-duplicate detection. In this project, we propose a novel e-mail abstraction scheme, which considers e-mail layout structure to represent e-mails. We present a procedure to generate the e-mail abstraction using HTML content in e-mail, and this newly devised abstraction can more effectively capture the near-duplicate phenomenon of spams. Moreover, we design a complete spam detection system (standing for COllaborative Spam DEtection System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update scheme enables system to keep the most up-to-date information for near-duplicate detection. We evaluate on a live data set collected from a real e-mail server and show that our system outperforms the prior approaches in detection results and is applicable to the real world.
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.