Support Vector Machine – Recursive Feature Elimination for Feature Selection on Multi-omics Lung Cancer Data

Authors

  • Nuraina Syaza Azman
  • Azurah A Samah
  • Ji Tong Lin
  • Hairudin Abdul Majid
  • Zuraini Ali Shah
  • Nies Hui Wen
  • Chan Weng Howe

DOI:

https://doi.org/10.36877/pmmb.a0000327

Abstract

Biological data obtained from sequencing technologies is growing exponentially. Multi-omics data is one of the biological data that exhibits high dimensionality, or more commonly known as the curse of dimensionality. The curse of dimensionality occurs when the dataset contains many features or attributes but with significantly fewer samples or observations. The study focuses on mitigating the curse of dimensionality by implementing Support Vector Machine – Recursive Feature Elimination (SVM-RFE) as the selected feature selection method in the lung cancer (LUSC) multi-omics dataset integrated from three single omics dataset comprising genomics, transcriptomics and epigenomics, and assess the quality of the selected feature subsets using SDAE and VAE deep learning classifiers. In this study, the LUSC datasets first undergo data pre-processing, including checking for missing values, normalization, and removing zero variance features. The cleaned LUSC datasets are then integrated to form a multi-omics dataset. Feature selection was performed on the LUSC multi-omics data using SVM-RFE to select several optimal feature subsets. The five smallest feature subsets (FS) are used in classification using SDAE and VAE neural networks to assess the quality of the feature subsets. The results show that all 5 VAE models can obtain an accuracy and AUC score of 1.000, while only 2 out of 5 SDAE models (FS 1000 & 4000) can do so. 3 out of 5 SDAE models have an AUC score of 0.500, indicating zero capability in separating the binary class labels. The study concludes that a fine-tuned supervised learning VAE model has better capability in classification tasks compared to SDAE models for this specific study. Additionally, 1000 and 4000 are the two most optimal feature subsets selected by the SVM-RFE algorithm. The SDAE and VAE models built with these feature subsets achieve the best classification results.

Downloads

Published

2023-04-04

Issue

Section

Original Research Articles
Abstract viewed = 864 times
PDF downloaded = 427 times