- Research
- Open access
- Published:
Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study
BMC Genomics volumeĀ 25, ArticleĀ number:Ā 1086 (2024)
Abstract
Background
One of the most promising approaches for early and more precise disease prediction and diagnosis is through the inclusion of proteomics data augmented with clinical data. Clinical proteomics data is often characterized by its high dimensionality and extremely limited sample size, posing a significant challenge when employing machine learning techniques for extracting only the most relevant information. Although there is a wide array of statistical techniques and numerous analysis pipelines employed in proteomics data analysis, it is unclear which of these methods produce the most efficient, reproducible, and clinically meaningful results.
Results
In this study, we compared 9 unique analysis schemes comprised of different machine learning and dimensionality reduction methods for the analysis of simulated proteomics data consisting of 1317 proteins measured in 26 subjects (i.e., 13 controls and 13 cases). In scenarios where the sample size is extremely small (i.e., nā<ā30), all schemes resulted in an exceptionally high level of performance metrics, indicating potential overfitting. While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory between groups demonstrated a substantial level of heterogeneity. However, despite heterogeneity in the selected proteins, their biological pathways and genetic diseases exhibited similarities. A sensitivity analysis conducted using varying sample sizes indicated that the stability of a set of selected biomarkers improves with larger sample sizes within a scheme.
Conclusions
When the aim of the study is to identify a statistical model that best distinguishes between cohort groups using proteomics data and to uncover the biological pathways and disorders common among the selected proteins, the majority of widely used analysis pipelines perform similarly. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and utilize them for subsequent investigations, meticulous consideration is necessary when opting for statistical models, due to the possibility of heterogeneity in the sets of selected proteins.
Introduction
We live in an era of data overload. Technical and instrumental advances, particularly in many areas of biomedical research, have enabled us to gain more knowledge about the pathological and physiological states of diseases in less time and with fewer efforts. At the same time, technological improvements pose a challenge to us in terms of efficiently and effectively extracting meaningful information from a massive amount of highly intercorrelated and extremely large data. Thus, reliable and robust statistical methodologies for dimensionality reduction and feature selection are now more critical than they have ever been before.
Proteins are considered appealing biomarkers because they are convenient indicators of diseases and, as a result, are frequently targets of therapeutic interventions. Panels of proteins can be used to devise cost-effective and non-invasive testing tools for accurate diagnosis. Among a multitude of proteins, identifying those that are differentially expressed (in terms of expression levels) compared to those found in the normal (or disease-free) biological state of cells is of the utmost importance in elucidating the etiology and treatment of a disease. For this reason, statisticians frequently seek a solution to the following classifier-associated question: Which set of biomarkers best discriminates between subjectsā disease status (e.g., presence or absence of a disease, stages of disease progression).
The majority of biomedical studies have predominantly employed the enzyme-linked immunosorbent assay (ELISA) technology to quantify proteins. The main mechanism of ELISA is antigen-antibody reactions. The advantages of it include the simplicity of the procedure, high specificity, and the use of commercial antibodies [1]. Large-scale investigations using ELISA are mostly difficult to conduct due to the number of samples that can be analyzed at the same time or due to technical constraints in quantifying low-abundance proteins [2]. To overcome these drawbacks, an aptamer-based technology, SOMAScan assay, was developed in 2010 by Somalogic Inc. (Boulder, CO) [3]. Aptamers are chemically synthesized and single-stranded peptide molecules capable of binding to a specific target molecule. They are smaller in size and cheaper to produce compared to antibodies [4,5,6]. SOMAScan assay is known to detect at least 1300 (up to nearly 11,000 with the latest version) proteins simultaneously [7, 8]. Hereafter, the data from each assay will now be referred to as SOMAScan data and ELISA data, respectively.
The current proteomics statistical analyses typically progress in two sequential phases: (1) discovery and (2) validation [9,10,11,12]. The objective of the discovery phase is to identify and quantify as many (untargeted) proteins as possible. SOMAScan assay is well suited for this phase due to its ability to generate affordable yet massive proteomics data, in a short amount of time. The purpose of the validation phase is to analyze and confirm the utility of the biomarkers selected in the previous phase. An independent cohort of patients with the same condition generally constitutes adequate validation with a limited number of proteins (fewer than 30), and patient sample size determined by power analysis. For the validation phase, data from immunological assays such as ELISA or targeted mass-spectrometry assays are used [13]. Both phases seek to identify a smaller set of proteins that best discriminate between groups of interest in the cohort. The number of proteins that progress from the discovery to the validation phases is often determined by machine learning (ML) methods such as random forest (RF) or simply a (parametric) t-test. However, there are several issues in this current analytic process that utilize both SOMAScan and ELISA data. First, a (parametric or non-parametric) t-test for the discovery phase may not be suitable due to the strict assumptions of the t-test. Second, the most discriminative proteins identified using SOMAScan data may not necessarily show similar discriminating power when examined using ELISA data. Third, due to the fact that the majority of widely used ML techniques were primarily designed for large datasets, the limited sample size often encountered in biomedical data can present challenges when conducting analyses. Fourth, and most importantly, an overwhelming variety of statistical techniques are utilized in each phase without understanding whether any one method gives more reliable results. Consequently, even when using comparable data, the results of each study are highly heterogeneous, rendering comparisons unstable or even unfeasible. Furthermore, apart from the aforementioned challenges, it is common to encounter clinical data with a significantly small number of subjects, often resulting in cases where the ratio between the number of patients and the number of proteins is less than 5%. Such a low rate significantly impacts the accuracy of analyses, particularly those that incorporate ML techniques, and it leads to numerous challenges within the analysis procedure itself.
Hypothesis
The primary goal of this study is to discover an optimal workflow for identifying biomarkers that best differentiate binary patient groups (e.g., control vs. diseased) using proteomics data derived by SOMAScan technologies, under the challenging scenario of extremely small sample sizes (i.e., nā<ā30). To achieve the primary goal, we compare and assess 1-stage and multi-stage analytic schemes. The 1-stage scheme here employs only one statistical classifier (i.e., a supervised ML method) to determine the top 10 or 30 discriminatory biomarkers for group separation. The multi-stage scheme incorporates multiple statistical methods, with a classifier serving as the last method.
We hypothesize that there is no substantial difference between the selected biomarkers and performance metrics between 1-stage and multi-stage schemes. Then, it can be argued that performing the analysis in more than one stage is inefficient in terms of time and cost. In this context, it is presumed that the selected biomarkers from each scheme are the most discriminatory among groups of interest. Note that in this study, we concentrated on using only SOMAScan data for both discovery and validation phases. There are two primary reasons for this. First, among various technologies used to quantify protein levels, the SOMAScan assay, an aptamer-based platform, has proven successful [14,15,16,17], overcoming several drawbacks associated with ELISA. Second, in our previous study, we identified a significant correlation (i.e., Pearson correlationā>ā0.8) between the SOMAScan and ELISA assay results for the majority of proteins [10].
Data and methods
Analytic schemes under consideration
In this study, we defined 9 distinct analytic schemes whose primary purpose is to distinguish two groups. Each scheme differs based on the type of classifier, the use of unsupervised ML and the presence or absence of dimensional reduction. TableĀ 1 provides a list of the schemes under consideration in this study and brief descriptions of each.
Data description
In this study, the nine schemes mentioned above exclusively employed the SOMAScan data. The number of proteins and subjects in the SOMAScan data are 1317 and 26, respectively. Among 26 subjects, 13 subjects were diagnosed with colorectal cancer (CRC), and the remaining 13 subjects were healthy controls. For more details, we refer to Li et al. (2021) [10].
To prevent the results from this study being data-dependent, 200 simulated datasets were constructed based on the original observed SOMAScan data each consisting of 1317 proteins and 13 subjects each in the disease and control group. The number of simulated datasets was restricted by the total computation cost of fitting all 9 schemes (i.e., 10ā25 days). We simulated data using a multivariate truncated normal distribution [18] with similar minimum, maximum, mean, standard deviation, and covariance as the observed biomarkers. Note that while the levels of each protein may not be normally distributed, the choice to use such a distribution is based on practicality and feasibility of the simulation study. The levels of each protein were standardized to have a mean of zero and a unit standard deviation before conducting all analyses.
Unsupervised and supervised machine learning methods
The tested supervised ML methods include RF, support vector machine (SVM), NaĆÆve Bayes (NB), and penalized logistic regression. Schemes 3 and 4 employed K-means clustering and Gaussian mixture (GM) modeling, the two most common unsupervised ML algorithms, for dimensional reduction. This is in contrast to the majority of studies that employ unsupervised ML models for discovering hidden and intriguing patterns or clusters in unlabeled data. In Scheme 3, for the K-means clustering method, the number of clusters for these methods was fixed at 3. The reason for this is that when the number of clusters is 4 or more, at least one of the generated clusters contained 10 or fewer biomarkers. Upon visual inspection, we found that the smallest cluster did not significantly contribute to the overall protein separability. We calculated the Euclidean distance between the centroids of the clusters constructed by unsupervised ML models and the biomarkers belonging to each cluster. Then, we selected a maximum of 100 biomarkers with the shortest distance from each cluster. Every ML technique possesses distinct parameters to be tuned, commonly referred to as hyperparameters. Comprehensive explanation concerning these specific parameters in the abovementioned ML methods can be found in the Supplementary document.
Details of each scheme
Each scheme was characterized by the presence or absence of unsupervised learning-based dimensional reduction and correlation coefficient-based dimensional reduction, although the final results of most schemes (except for Schemes 2, 6, and 7) were obtained through penalized regression. The objective for including a dimensional reduction approach in the beginning is to reduce the difference between the number of subjects and the number of biomarkers in the final stage by deleting biomarkers with similar information. With 1317 biomarkers and 26 subjects, the number of biomarkers may result in overfitting in group prediction, making feature reduction a crucial step. Moreover, some biomarkers may have nearly zero variance, high correlations with other biomarkers, and little relevance to the objective of cancer prediction. Dimensional reduction was performed using unsupervised ML methods (i.e., K-means clustering and GM models), correlation coefficients, the AUC, or combinations thereof. Dimension reduction based on correlation coefficients involved calculating pairwise Pearson correlations and eliminating one of a pair of biomarkers that exhibited correlation coefficients exceeding 0.9 with each other. The biomarker with the higher mean correlation coefficient with other biomarkers was removed. AUC dimension reduction consisted of removing proteins with AUCā<ā0.70.
Performance metrics and biomarker selection
Each scheme, except for Scheme 7 (extremely large computational time), was applied to all 200 simulated datasets and the original SOMAScan data. We compared the schemes based on a number of performance metrics calculated via leave-one-out cross-validation (LOOCV) to bypass potential problems of overfitting. Metrics included the Brier score, AUC, sensitivity, specificity, and prediction accuracy. Among them, the Brier score is the most crucial performance metric in this study and was used to rank the analytic schemes. If there were two or more tied schemes based on the Brier score rounded to three decimal places, the AUC was utilized for further ranking. Considering that the sample size is relatively small, we added a bootstrap step in the analysis to obtain optimism-corrected performance metrics [19]. The final performance metrics were the average of each of the performance measures obtained from 200 bootstrap iterations, except for Scheme 7. For Scheme 7 that used an NB method, 50 simulated datasets and 50 bootstrap iterations were utilized due to extremely high computational time.
Note that, in the recent literature, the log-loss [20] (also known as logistic loss or cross entropy loss) is frequently used to evaluate the classifiers. However, we chose to focus on the Brier score for three reasons. First, the log loss has a range of 0 to infinity, while the Brier score ranges from 0 to 1. The Brier score would be more informative when comparing the schemes because it has definite lower and upper limits. Second, the mathematical formula of the Brier score is easier to understand compared to that of the log loss. Third, we confirmed that, for the schemes in this study, there is a positive linear relationship between the Brier score and log loss (Supplementary Fig.Ā 1), signifying that they both convey similar information. TableĀ 2 shows the concise mathematical expressions of performance metrics under consideration in this study. For further details, we refer the readers to Fergus and Chalmers (2022) [21].
In addition to performance metrics, we compared the 10 or 30 most important biomarkers for distinguishing between the groups selected by each scheme. When the LASSO was the final statistical method, the importance was determined by the absolute value of each biomarkerās coefficient produced in the final regression model. When RF or NB was used as a final statistical method, then the pairwise AUC [22] was used to determine the importance of each biomarker.
Sensitivity analysis: sample size
Additionally, we varied the sample size of the simulated data (200ā1600) to assess whether the results behave differently depending on the sample size. We used the same model parameters and distributions to simulate the modified synthetic data. Using the same procedure as described above, the important biomarkers for discriminating between the two groups and the performance metrics were identified. Given the considerable computation time required to complete the analysis, the sensitivity analysis in this study was limited to the utilization of Schemes 1, 2, and 5. For Scheme 1, the sample sizes of the simulated datasets were 200, 400, 1000, and 1600. For Scheme 2, the sample sizes of the simulated datasets were 200, 600, 1000, and 1600. For Scheme 5, the sample sizes of the simulated datasets were limited to 200, 600, and 1600.
We also evaluated a sample size of 18 subjects (Scheme 1 only) to mimic sample size in pilot data in proteomics studies. While such studies often include fewer than five subjects per group, we focus on the scale of omics data used in clinical studies and randomized controlled trials. Such omics data typically include more than 10 but fewer than 30 subjects per group, along with a significantly larger number of proteins. Furthermore, we utilized widely used and currently available R software packages for variable selection and classification, such as glmnet [25], which requires a minimum of 9 subjects per group.
Biological pathways and related diseases
The initial SOMAScan dataset included a distinct identifier for each protein based on the UniProt database [26] (Uniprot ID). Given the typically lengthy names of proteins, we opted to employ their corresponding gene names. We employed two online databases to ascertain the primary biological pathways and genetic diseases that exhibited the highest associations with the top 10 or 30 selected biomarkers from each scheme. First, we utilized the Reactome database (https://reactome.org/) to identify the predominant biological pathway among the biomarkers selected from each scheme. Reactome is a pathway database that is open-source, open access, meticulously curated, and undergoes peer review. Second, we used the DAVID database [27] to identify the most frequent genetic disorders to which the selected biomarkers from each scheme contribute the most.
Results
Similarity in performance metrics
TableĀ 3 shows the summary of performance metrics of each scheme considered in this study. Note that the computational time for Scheme 7, which employed the NB method, was significantly longer compared to the other methods. As a result, the number of simulated datasets (i.e., 50) and the number of iterations (i.e., 50) for the bootstrap step was restricted. Due to this disparity, the performance metrics of Scheme 7 was not directly compared to those of the other schemes.
In our case, the ratio of the number of patients (\(\:n\)) to the number of proteins (\(\:p\)) is only 2% (n/pā=ā26/1,317ā=ā0.02), which is significantly lower than the \(\:n\)-\(\:p\) ratio typically employed in research related to cancer biomarkers [28]. All schemes considered here produced an AUC of (approximately) 1, potentially representing the overfitting caused by the so-called curse of dimensionality [29]. The lowest Brier score was achieved by Scheme 3, which employed unsupervised machine learning methods for dimension reduction and restricted the number of biomarkers to a maximum of 300 for the LASSO regression. However, considering that the largest difference in the Brier score between the schemes was 0.088, the Brier scores were similar across all the schemes.
Scheme 8 and Scheme 9 differed solely in the number of iterations involved in the variable removal process. In Scheme 9, the approach entailed two rounds of dimensional reduction: first utilizing the correlation coefficient criterion, followed by a subsequent reduction based on the AUC. This sequential reduction strategy aims to maximize the \(\:n\)-\(\:p\) ratio for LASSO, which is the final step in the process. However, despite resulting in a different set of selected biomarkers, this additional variable removal process did not contribute to any improvement in performance metrics.
The distinction between Scheme 5 and Scheme 8 lied solely in the choice of variable removal method, either AUC or correlation coefficients. However, even with this variation, there were no significant disparities observed in the performance metrics.
Heterogeneity of selected proteins
While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory demonstrated a substantial level of heterogeneity. TableĀ 4 and Supplementary TableĀ 1 show a set of 10 and 30 selected proteins that best discriminate the two subject groups, respectively. Note that Scheme 3 had a maximum of 14 selected biomarkers, dictated by the algorithms used. As a result, when comparing the selected biomarkers between Scheme 3 and other schemes, we focused on the top 10 selections. When comparing the selected biomarkers between these schemes and others, we determined the count of overlapped biomarkers instead of comparing overlapping rates. Excluding this specific scheme, we conducted a comparison using the top 30 selected biomarkers, where we calculated the overlapping rates between the two schemes. Supplementary TableĀ 5 displays the rates or counts of overlapping biomarkers for every pair of schemes considered in this study.
Schemes 3 and 4, which used unsupervised ML techniques for dimensional reduction, yielded a different collection of selected biomarkers, with only one overlapping biomarker out of the top 10 selected biomarkers. As TableĀ 5 demonstrates, Scheme 8 and 9, which differed solely in the number of the variable removal processes, had an overlapping rate of 60% out of the top 30 selected biomarkers. In fact, it was the highest overlapping rate among all possible combinations of two schemes. Scheme 5 and 8, which differed solely in the choice of variable removal method, had an overlapping rate of 43% out of the top 30 selected biomarkers. This indicates that the choice of variable removal process and the number of iterations employed prior to applying the final regression model impact the selection of biomarkers. Schemes 5, 6, and 7 adopt distinct supervised ML methods as the final step, following a single round of variable removal using the same method. In this case, the selected variables among these schemes exhibit small overlap.
The 10 biomarkers that appeared most frequently across all schemes were HAAO, AIMP1, REG1A, AMH, CSNK2B, PRSS22, SECTM1, CNDP1, CST5, and EFEMP1, listed in order of frequency (Supplementary Fig.Ā 2).
Similarity in biological pathways and associated diseases
TableĀ 6 presents a concise overview of the two most relevant biological pathways and genetic disorders that display significant associations with the top 10 or 30 selected biomarkers obtained from each scheme. Except for Schemes 2 and 7, the selected biomarkers from all schemes demonstrated a strong relationship with genetic disorders resulting from compromised immune response. While the genetic disorders associated with Scheme 2 were not directly influenced by immune function, it was noteworthy that the second most important biological pathway identified was the immune system. In Scheme 7, one of the predominant biological pathways was the L1CAM interactions, which is known to be expressed abnormally in several tumor types [30]. In essence, despite the heterogeneity in the selected biomarkers, we identified common biological pathways and relevant genetic disorders derived from them.
Increasing overlapping rate in data with a larger sample size
The initial dataset comprised a total of 26 subjects, with a \(\:n\)-\(\:p\) ratio of 2%. As demonstrated by TableĀ 7, Supplementary TablesĀ 2ā4, it was observed that within a single scheme, the top 10 selected biomarkers exhibited increased stability as the \(\:n\)-\(\:p\) ratio progressively increased. However, despite surpassing a \(\:n\)-\(\:p\) ratio of 100%, significant heterogeneity persisted in the biomarker selection across different schemes. This indicates that while the stability of the model improves with higher \(\:\text{n}\)-\(\:p\) ratios, the choice of statistical techniques has a substantial impact on biomarker selection.
Discussion
In a clinical setting, the primary goal of most biomarker-based studies is to identify a limited number of biomarkers that are most predictive of disease, its progression, or response to treatment. In this study, we aimed to compare the performance of prevalent statistical methodologies and analysis pipelines during the process of biomarker selection for optimal patient group prediction using exclusively SOMAScan data with a limited number of subjects. Statistical methods used for biomarker selection in this study included both supervised and unsupervised ML methods. By utilizing the aforementioned methods either individually or in conjunction, a total of 9 distinct analytical approaches were formulated and subsequently assessed for their respective model performances. Note that the final results were derived from 200 simulated datasets and 200 bootstrap iterations within a single statistical model (with the exception of Scheme 7, which had 50 simulated datasets and 50 bootstrap iterations). The choice of the number of simulated datasets and bootstrap iterations was predetermined with consideration of the running time required for one bootstrap iteration, which ranged from around 25Ā s to 20Ā min on an Apple MacBook equipped with an Apple M2 processor and 16GB of RAM.
Upon evaluating performance metrics, it was observed that, when the \(\:n\)-\(\:p\) ratio was only 2%, all schemes considered in this study resulted in an exceptionally high level of performance metrics. However, these high values do not necessarily suggest that the statistical models were exceptionally adept at distinguishing between patient groups. Instead, they appear to indicate a potential issue with overfitting. This suggests that, when dealing with a limited sample size and/or a small \(\:n\)-\(\:p\) ratio, using ML techniques to select important biomarkers and quantify their ability to differentiate between groups of interest requires careful consideration, because the issue of overfitting may arise. Note that, when each scheme was fitted to the original SOMAScan dataset, overfitting was consistently observed (Supplementary TableĀ 6). However, Scheme 2, which exclusively employed SVM, exhibited relatively lower sensitivity and prediction accuracy compared to other schemes. SVM is notoriously computationally intensive and highly sensitive to parameter choices compared to other ML techniques [31, 32]. In this study, due to realistic computation time, significant limitations were imposed on the range of parameter values that could be estimated. As a result, the challenges associated with SVM may have become more apparent, potentially resulting in relatively lower performance metrics. As Supplementary TableĀ 6 implies, this may have been particularly notable when applying Scheme 2 to the original SOMAScan dataset as opposed to multiple simulated datasets.
Despite the similarities in performance metrics, we observed a substantial degree of heterogeneity in the selected biomarkers across each scheme. The same phenomenon was observed when all schemes were employed on the original SOMAScan dataset. This heterogeneity underscores the caution that must be taken in the selection of appropriate statistical methodologies for biomarker selection when dealing with proteomics data with a limited sample size less than 30. Schemes 3 and 4, which used unsupervised ML techniques for dimensional reduction, yielded different results in the selected biomarkers with a low overlapping rate. The combination of Schemes 8 and 9, which only differed in the number of the variable removal processes, achieved the highest rate of overlap at 60% among the combinations of two schemes. It implies that the addition of the biomarker removal process does not substantially affect the selected biomarker set. In case of Schemes 5, 6, and 7, where different supervised ML methods were employed as the final step, the biomarkers selected within these schemes exhibited minimal overlap. This phenomenon suggests that although ML algorithms may yield similar performance metrics, there is a notable discrepancy in the identification of the most discriminatory biomarkers. In other words, it implies that, purely from a statistical modeling perspective, we cannot endorse one particular statistical model over another. However, potential model selection could be performed by discussing the identified discriminatory biomarkers with bioinformaticians and then choosing the statistical model that produced biomarkers more suitable for the clinical context as the final model.
While a substantial heterogeneity was observed in the biomarkers selected among different schemes, there was a congruence in the biological pathways and associated diseases associated with the selected biomarkers across these schemes. The biomarkers selected in the majority of schemes exhibited a strong relationship with genetic disorders and/or biological pathways associated with the immune system. In Scheme 7, the L1CAM interactions was one of the predominant biological pathways. Due to the potential critical role of L1CAM in tumor progression [33], it is expected that it will have an indirect impact on the immune system or be influenced by it.
As the \(\:n\)-\(\:p\) ratio increased, so did the stability of selected biomarkers. However, despite a sufficiently large \(\:\text{n}\)-\(\:p\) ratio, significant heterogeneity persisted in the biomarker selection across all schemes considered in this study. This finding strongly reinforces our fundamental proposition that, regardless of the sample sizes, when the primary objective of a study is to identify a biomarker that contributes to distinguishing between patient and control groups (or groups with other distinct characteristics), careful consideration must be given to the selection of appropriate statistical techniques and the design of the analytical procedure.
There are the following four caveats to consider. First, this study yielded findings based on multiple simulated datasets that were based on a single observed dataset. Note that, in the process of generating simulated datasets from SOMAScan data, we utilized summary statistics for each protein to ensure that the simulated datasets resembled the observed data. However, for the sake of simplicity and feasibility, we utilized a truncated normal distribution to generate the data. Consequently, it is possible that the initial distribution of certain proteins might have been slightly distorted in the process. This should not affect the comparisons between the schemes that utilized simulated datasets. However, it may have introduced some variability when comparing them to the schemes that relied solely on observed SOMAScan data. Second, this study exclusively employed ML techniques, which are currently prevalent in the analysis of proteomics data and are conveniently executable through R packages. There exist numerous ML methodologies for feature selection that might outperform those examined in this study. However, their applicability is sometimes hindered by the absence of adequate software support. Third, the comparison between the schemes considered in this study involved simulation-based and heuristic assessments, without delving into a mathematical framework to quantitatively gauge the distinctions. A promising extension of this study would be to demonstrate theoretical differences among the schemes. Last, we defined a dataset with 26 subjects as extremely small. However, we acknowledge that pilot data, which also need to be analyzed for larger studies, sometimes include fewer than five subjects per group. This study does not consider data with fewer than 8 subjects per group, primarily because the optimizers of the R packages we used do not reliably converge with too few events. For analyses with fewer than 8 subjects, variable selection and classification could be performed using a Bayesian framework; however, this exceeds the scope of our current study but is a potential extension of the work.
In spite of the caveats discussed, this study holds significance as it brings new insights into formulating a pipeline for analyzing proteomics data with constraints posed by limited sample sizes. Based on the current findings, we suggest the following. If the ultimate objective of the research is to identify a statistical model that best distinguishes between cohort groups using proteomics data, it is recommended that increasing the \(\:\text{n}\)-\(\:p\) ratio should be a primary focus before employing ML techniques to avoid overfitting. When the principal aim of the study is to uncover the biological pathways and genetic disorders common among the selected proteins, the selection of statistical techniques can be regarded as relatively flexible. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and subsequently utilize them for subsequent investigations (e.g., devising a diagnostic tool), meticulous consideration is necessary when opting for statistical models. This is due to the possibility of heterogeneity in the sets of selected proteins, contingent on the particular statistical model incorporated into the analytical plan.
Data availability
The observed data, whose summary statistic was used to generate simulated datasets, is available upon request from Chandra Mohan (cmohan@central.uh.edu). The R code for simulation is available upon request from Kyung Hyun Lee (Kyung.Hyun.Lee@uth.tmc.edu).
Abbreviations
- ML:
-
Machine Learning
- RF:
-
Random Forest
- AUC:
-
Area Under the Curve
- CRC:
-
Colorectal Cancer
- KNN:
-
K-nearest Neighbor
- SVM:
-
Support Vector Machine
- PCA:
-
Principal Component Analysis
- CI:
-
Confidence Interval
- ROC:
-
Receiver operating characteristic
- NB:
-
NaĆÆve Bayes
- GM:
-
Gaussian Mixture
- BIC:
-
Bayesian Information Criterion
- LOOCV:
-
Leave-One-Out Cross-Validation
References
Sakamoto S, Putalun W, Vimolmangkang S, et al. Enzyme-linked immunosorbent assay for the quantitative/qualitative analysis of plant secondary metabolites. J Nat Med. 2018;72:32ā42.
Giudice V, Biancotto A, Wu Z, et al. Aptamer-based proteomics of serum and plasma in acquired aplastic anemia. Exp Hematol. 2018;68:38ā50.
Ostroff R, Foreman T, Keeney TR, et al. The stability of the circulating human proteome to variations in sample collection and handling procedures measured with an aptamer-based proteomics array. J Proteom. 2010;73:649ā66.
Toh SY, Citartan M, Gopinath SC, et al. Aptamers as a replacement for antibodies in enzyme-linked immunosorbent assay. Biosens Bioelectron. 2015;64:392ā403.
Lakhin A, Tarantul V, Gening L. Aptamers: problems, solutions and prospects. Acta Naturae ŠŠ½Š³Š»Š¾ŃŠ·ŃŃŠ½Š°Ń ŠŠµŃŃŠøŃ; 5.
Christiansson L, Mustjoki S, Simonsson B, et al. The use of multiplex platforms for absolute and relative protein quantification of clinical material. EuPA Open Proteom. 2014;3:37ā47.
Coombs KM. Update on proteomic approaches to uncovering virus-induced protein alterations and virusāhost protein interactions during the progression of viral infection. Expert Rev Proteom. 2020;17:513ā32.
Tanaka T, Lavery R, Varma V, et al. Plasma proteomic signatures predict dementia and cognitive impairment. Alzheimers Dement Transl Res Clin Interv. 2020;6:e12018.
Kollar B, Shubin A, Borges TJ, et al. Increased levels of circulating MMP3 correlate with severe rejection in face transplantation. Sci Rep. 2018;8:1ā12.
Li H, Vanarsa K, Zhang T, Soomro S, Cicalese PA, Duran V, et al. Comprehensive aptamer-based screen of 1317 proteins uncovers improved stool protein markers of colorectal cancer. J gastroenterolo. 2021;56(7):659ā72.
Soomro S, Venkateswaran S, Vanarsa K, et al. Predicting disease course in ulcerative colitis using stool proteins identified through an aptamer-based screen. Nat Commun. 2021;12:1ā11.
Tawalbeh SM, Marin W, Morgan GA, et al. Serum protein biomarkers for juvenile dermatomyositis: a pilot study. BMC Rheumatol. 2020;4:1ā15.
Nakayasu ES, Gritsenko M, Piehowski PD, et al. Tutorial: best practices and considerations for mass-spectrometry-based protein biomarker discovery and validation. Nat Protoc. 2021;16:3737ā60.
Kaur G, Roy I. Therapeutic applications of aptamers. Expert Opin Investig Drugs. 2008;17:43ā60.
Cerchia L, De Franciscis V. Targeting cancer cells with nucleic acid aptamers. Trends Biotechnol. 2010;28:517ā25.
Kim J, Hu J, Sollie RS, et al. Improvement of sensitivity and dynamic range in proximity ligation assays by asymmetric connector hybridization. Anal Chem. 2010;82:6976ā82.
Ma H, Liu J, Ali MM, et al. Nucleic acid aptamers in cancer research, diagnosis and therapy. Chem Soc Rev. 2015;44:1240ā56.
Bhattacjarjee S, Bhattacharjee MS. Package ātmvnsimā.
Steyerberg EW, Harrell FE Jr, Borsboom GJ, et al. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54:774ā81.
Murphy KP. Machine learning: a probabilistic perspective. MIT Press; 2012.
Fergus, Paul, and Carl Chalmers. āPerformance evaluation metrics.ā Applied deep learning: Tools, Techniques, and Implementation. Cham: Springer International Publishing, 2022;115ā38.
Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008;28:1ā26.
Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45:171ā86.
Ling CX, Huang J, Zhang H. AUC: a statistically consistent and more discriminating measure than accuracy; 2003. pp. 519ā524.
Hastie T. An introduction to glmnet. Vignette Doc R Glmnet Package.
UniProt. https://www.uniprot.org/ (accessed 21 2023).
Dennis G, Sherman BT, Hosack DA, et al. DAVID: database for annotation, visualization, and integrated discovery. Genome Biol. 2003;4:1ā11.
Al-Tashi Q, Saad MB, Muneer A, et al. Machine learning models for the identification of prognostic and predictive Cancer biomarkers: a systematic review. Int J Mol Sci. 2023;24:7781.
Bzdok D, Krzywinski M, Altman N. Machine learning: a primer. Nat Methods. 2017;14:1119.
Giordano M, Cavallaro U. Different shades of L1CAM in the pathophysiology of cancer stem cells. J Clin Med. 2020;9:1502.
Cervantes J, Garcia-Lamont F, RodrĆguez-Mazahua L, et al. A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing. 2020;408:189ā215.
RodrĆguez-PĆ©rez R, Bajorath J. Evolution of support vector machine and regression modeling in chemoinformatics and drug discovery. J Comput Aided Mol Des. 2022;36:355ā62.
Wang B, Li J, Li X, et al. Identifying prognosis and metastasisāassociated genes associated with ewing sarcoma by weighted gene coāexpression network analysis. Oncol Lett. 2019;18:3527ā36.
Funding
This work is supported by the Institutional Scholar Awards (project number 1UL1TR003137) from the University of Texas Health Science Center at Houston.
Author information
Authors and Affiliations
Contributions
KL and CP designed the study. KL analyzed the data and performed the simulation studies. KL and CP drafted and revised the manuscript. CM provided proteomics data in the required format. SA provided technical support in identifying biological pathways and associated diseases. CM and SA revised the manuscript. All authors agreed to be responsible for all aspect of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherās Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the articleās Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleās Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lee, K., Assassi, S., Mohan, C. et al. Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study. BMC Genomics 25, 1086 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-024-11018-2
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-024-11018-2