Analysis of feature selection stability on high dimension and small sample data.
Rs3,000.00
10000 in stock
SupportDescription
Feature selection is an important step when building a classifier on high dimensional data .As the number of observations is small, the feature selection tends to be unstable. It is common that two feature subsets, obtained from different datasets but dealing with the same classification problem, do not overlap significantly. Although it is a crucial problem, few works have been done on the selection stability. The histopathological sub classification of lung adenocarcinoma is challenging. In one study, independent lung pathologists agreed on lung adenocarcinoma sub classification in only 41% of cases However, a favorable prognosis for bronchioloalveolar carcinoma (BAC), a histological subclass of lung adenocarcinoma, argues for refining such distinctions In addition, metastases of non-lung origin can be difficult to distinguish from lung adenocarcinomas. The behavior of feature selection is analyzed in various conditions, not exclusively but with a focus on t-score based feature selection approaches and small sample data. A reliable set of predictive genes also will contribute to a better understanding of the biological mechanism of metastasis. Several groups have published lists of predictive genes and reported good predictive performance based on them. However, the gene lists obtained for the same clinical types of patients by different groups differed widely and had only very few genes in common. This lack of agreement raised doubts about the reliability and robustness of the reported predictive gene lists, and the main source of the problem was shown to be the small number of samples that were used to generate the gene lists. Here, we introduce a previously undescribed mathematical method, probably approximately correct (PAC) sorting, for evaluating the robustness of such lists. The analysis is in three steps: the first one is theoretical using a simple mathematical model; the second one is empirical and based on artificial data; and the last one is based on real data. These three analyses lead to the same results and give a better understanding of the feature selection problem in high dimension data.
Only logged in customers who have purchased this product may leave a review.
Reviews
There are no reviews yet.