Combining Tag and Value Similarity for Data Extraction and Alignment
Rs2,500.00
10000 in stock
SupportDescription
Data extraction is the act or process of retrieving data out of data sources. Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This paper focuses on the problem of automatically extracting data records that are encoded in the query result pages generated by web databases. The goal of web database data extraction is to remove any irrelevant information from the query result page, extract the query result records (referred to as QRRs in this paper) from the page, and align the extracted QRRs into a table such that the data values1 belonging to the same attribute are placed into the same table column. We present a novel data extraction and alignment method called CTVS that combines both tag and value similarity. CTVS automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query result pages and then aligning the segmented QRRs into a table, in which the data values from the same attribute are put into the same column. We also design a new record alignment algorithm that aligns the attributes in a record, first pairwise and then holistically, by combining the tag and data value similarity information.
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.