Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study

Lee, Kyung Hyun; Assassi, Shervin; Mohan, Chandra; Pedroza, Claudia

doi:10.1186/s12864-024-11018-2

Research
Open access
Published: 14 November 2024

Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study

Kyung Hyun Lee¹,
Shervin Assassi²,
Chandra Mohan³ &
…
Claudia Pedroza¹

BMC Genomics volume 25, Article number: 1086 (2024) Cite this article

Abstract

Background

One of the most promising approaches for early and more precise disease prediction and diagnosis is through the inclusion of proteomics data augmented with clinical data. Clinical proteomics data is often characterized by its high dimensionality and extremely limited sample size, posing a significant challenge when employing machine learning techniques for extracting only the most relevant information. Although there is a wide array of statistical techniques and numerous analysis pipelines employed in proteomics data analysis, it is unclear which of these methods produce the most efficient, reproducible, and clinically meaningful results.

Results

In this study, we compared 9 unique analysis schemes comprised of different machine learning and dimensionality reduction methods for the analysis of simulated proteomics data consisting of 1317 proteins measured in 26 subjects (i.e., 13 controls and 13 cases). In scenarios where the sample size is extremely small (i.e., n < 30), all schemes resulted in an exceptionally high level of performance metrics, indicating potential overfitting. While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory between groups demonstrated a substantial level of heterogeneity. However, despite heterogeneity in the selected proteins, their biological pathways and genetic diseases exhibited similarities. A sensitivity analysis conducted using varying sample sizes indicated that the stability of a set of selected biomarkers improves with larger sample sizes within a scheme.

Conclusions

When the aim of the study is to identify a statistical model that best distinguishes between cohort groups using proteomics data and to uncover the biological pathways and disorders common among the selected proteins, the majority of widely used analysis pipelines perform similarly. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and utilize them for subsequent investigations, meticulous consideration is necessary when opting for statistical models, due to the possibility of heterogeneity in the sets of selected proteins.

Peer Review reports

Introduction

We live in an era of data overload. Technical and instrumental advances, particularly in many areas of biomedical research, have enabled us to gain more knowledge about the pathological and physiological states of diseases in less time and with fewer efforts. At the same time, technological improvements pose a challenge to us in terms of efficiently and effectively extracting meaningful information from a massive amount of highly intercorrelated and extremely large data. Thus, reliable and robust statistical methodologies for dimensionality reduction and feature selection are now more critical than they have ever been before.

Proteins are considered appealing biomarkers because they are convenient indicators of diseases and, as a result, are frequently targets of therapeutic interventions. Panels of proteins can be used to devise cost-effective and non-invasive testing tools for accurate diagnosis. Among a multitude of proteins, identifying those that are differentially expressed (in terms of expression levels) compared to those found in the normal (or disease-free) biological state of cells is of the utmost importance in elucidating the etiology and treatment of a disease. For this reason, statisticians frequently seek a solution to the following classifier-associated question: Which set of biomarkers best discriminates between subjects’ disease status (e.g., presence or absence of a disease, stages of disease progression).

The majority of biomedical studies have predominantly employed the enzyme-linked immunosorbent assay (ELISA) technology to quantify proteins. The main mechanism of ELISA is antigen-antibody reactions. The advantages of it include the simplicity of the procedure, high specificity, and the use of commercial antibodies [1]. Large-scale investigations using ELISA are mostly difficult to conduct due to the number of samples that can be analyzed at the same time or due to technical constraints in quantifying low-abundance proteins [2]. To overcome these drawbacks, an aptamer-based technology, SOMAScan assay, was developed in 2010 by Somalogic Inc. (Boulder, CO) [3]. Aptamers are chemically synthesized and single-stranded peptide molecules capable of binding to a specific target molecule. They are smaller in size and cheaper to produce compared to antibodies [4,5,6]. SOMAScan assay is known to detect at least 1300 (up to nearly 11,000 with the latest version) proteins simultaneously [7, 8]. Hereafter, the data from each assay will now be referred to as SOMAScan data and ELISA data, respectively.

The current proteomics statistical analyses typically progress in two sequential phases: (1) discovery and (2) validation [9,10,11,12]. The objective of the discovery phase is to identify and quantify as many (untargeted) proteins as possible. SOMAScan assay is well suited for this phase due to its ability to generate affordable yet massive proteomics data, in a short amount of time. The purpose of the validation phase is to analyze and confirm the utility of the biomarkers selected in the previous phase. An independent cohort of patients with the same condition generally constitutes adequate validation with a limited number of proteins (fewer than 30), and patient sample size determined by power analysis. For the validation phase, data from immunological assays such as ELISA or targeted mass-spectrometry assays are used [13]. Both phases seek to identify a smaller set of proteins that best discriminate between groups of interest in the cohort. The number of proteins that progress from the discovery to the validation phases is often determined by machine learning (ML) methods such as random forest (RF) or simply a (parametric) t-test. However, there are several issues in this current analytic process that utilize both SOMAScan and ELISA data. First, a (parametric or non-parametric) t-test for the discovery phase may not be suitable due to the strict assumptions of the t-test. Second, the most discriminative proteins identified using SOMAScan data may not necessarily show similar discriminating power when examined using ELISA data. Third, due to the fact that the majority of widely used ML techniques were primarily designed for large datasets, the limited sample size often encountered in biomedical data can present challenges when conducting analyses. Fourth, and most importantly, an overwhelming variety of statistical techniques are utilized in each phase without understanding whether any one method gives more reliable results. Consequently, even when using comparable data, the results of each study are highly heterogeneous, rendering comparisons unstable or even unfeasible. Furthermore, apart from the aforementioned challenges, it is common to encounter clinical data with a significantly small number of subjects, often resulting in cases where the ratio between the number of patients and the number of proteins is less than 5%. Such a low rate significantly impacts the accuracy of analyses, particularly those that incorporate ML techniques, and it leads to numerous challenges within the analysis procedure itself.

Hypothesis

The primary goal of this study is to discover an optimal workflow for identifying biomarkers that best differentiate binary patient groups (e.g., control vs. diseased) using proteomics data derived by SOMAScan technologies, under the challenging scenario of extremely small sample sizes (i.e., n < 30). To achieve the primary goal, we compare and assess 1-stage and multi-stage analytic schemes. The 1-stage scheme here employs only one statistical classifier (i.e., a supervised ML method) to determine the top 10 or 30 discriminatory biomarkers for group separation. The multi-stage scheme incorporates multiple statistical methods, with a classifier serving as the last method.

We hypothesize that there is no substantial difference between the selected biomarkers and performance metrics between 1-stage and multi-stage schemes. Then, it can be argued that performing the analysis in more than one stage is inefficient in terms of time and cost. In this context, it is presumed that the selected biomarkers from each scheme are the most discriminatory among groups of interest. Note that in this study, we concentrated on using only SOMAScan data for both discovery and validation phases. There are two primary reasons for this. First, among various technologies used to quantify protein levels, the SOMAScan assay, an aptamer-based platform, has proven successful [14,15,16,17], overcoming several drawbacks associated with ELISA. Second, in our previous study, we identified a significant correlation (i.e., Pearson correlation > 0.8) between the SOMAScan and ELISA assay results for the majority of proteins [10].

Data and methods

Analytic schemes under consideration

In this study, we defined 9 distinct analytic schemes whose primary purpose is to distinguish two groups. Each scheme differs based on the type of classifier, the use of unsupervised ML and the presence or absence of dimensional reduction. Table 1 provides a list of the schemes under consideration in this study and brief descriptions of each.

Table 1 List of schemes under consideration

Full size table

Data description

In this study, the nine schemes mentioned above exclusively employed the SOMAScan data. The number of proteins and subjects in the SOMAScan data are 1317 and 26, respectively. Among 26 subjects, 13 subjects were diagnosed with colorectal cancer (CRC), and the remaining 13 subjects were healthy controls. For more details, we refer to Li et al. (2021) [10].

To prevent the results from this study being data-dependent, 200 simulated datasets were constructed based on the original observed SOMAScan data each consisting of 1317 proteins and 13 subjects each in the disease and control group. The number of simulated datasets was restricted by the total computation cost of fitting all 9 schemes (i.e., 10–25 days). We simulated data using a multivariate truncated normal distribution [18] with similar minimum, maximum, mean, standard deviation, and covariance as the observed biomarkers. Note that while the levels of each protein may not be normally distributed, the choice to use such a distribution is based on practicality and feasibility of the simulation study. The levels of each protein were standardized to have a mean of zero and a unit standard deviation before conducting all analyses.

Unsupervised and supervised machine learning methods

The tested supervised ML methods include RF, support vector machine (SVM), Naïve Bayes (NB), and penalized logistic regression. Schemes 3 and 4 employed K-means clustering and Gaussian mixture (GM) modeling, the two most common unsupervised ML algorithms, for dimensional reduction. This is in contrast to the majority of studies that employ unsupervised ML models for discovering hidden and intriguing patterns or clusters in unlabeled data. In Scheme 3, for the K-means clustering method, the number of clusters for these methods was fixed at 3. The reason for this is that when the number of clusters is 4 or more, at least one of the generated clusters contained 10 or fewer biomarkers. Upon visual inspection, we found that the smallest cluster did not significantly contribute to the overall protein separability. We calculated the Euclidean distance between the centroids of the clusters constructed by unsupervised ML models and the biomarkers belonging to each cluster. Then, we selected a maximum of 100 biomarkers with the shortest distance from each cluster. Every ML technique possesses distinct parameters to be tuned, commonly referred to as hyperparameters. Comprehensive explanation concerning these specific parameters in the abovementioned ML methods can be found in the Supplementary document.

Details of each scheme

Each scheme was characterized by the presence or absence of unsupervised learning-based dimensional reduction and correlation coefficient-based dimensional reduction, although the final results of most schemes (except for Schemes 2, 6, and 7) were obtained through penalized regression. The objective for including a dimensional reduction approach in the beginning is to reduce the difference between the number of subjects and the number of biomarkers in the final stage by deleting biomarkers with similar information. With 1317 biomarkers and 26 subjects, the number of biomarkers may result in overfitting in group prediction, making feature reduction a crucial step. Moreover, some biomarkers may have nearly zero variance, high correlations with other biomarkers, and little relevance to the objective of cancer prediction. Dimensional reduction was performed using unsupervised ML methods (i.e., K-means clustering and GM models), correlation coefficients, the AUC, or combinations thereof. Dimension reduction based on correlation coefficients involved calculating pairwise Pearson correlations and eliminating one of a pair of biomarkers that exhibited correlation coefficients exceeding 0.9 with each other. The biomarker with the higher mean correlation coefficient with other biomarkers was removed. AUC dimension reduction consisted of removing proteins with AUC < 0.70.

Performance metrics and biomarker selection

Each scheme, except for Scheme 7 (extremely large computational time), was applied to all 200 simulated datasets and the original SOMAScan data. We compared the schemes based on a number of performance metrics calculated via leave-one-out cross-validation (LOOCV) to bypass potential problems of overfitting. Metrics included the Brier score, AUC, sensitivity, specificity, and prediction accuracy. Among them, the Brier score is the most crucial performance metric in this study and was used to rank the analytic schemes. If there were two or more tied schemes based on the Brier score rounded to three decimal places, the AUC was utilized for further ranking. Considering that the sample size is relatively small, we added a bootstrap step in the analysis to obtain optimism-corrected performance metrics [19]. The final performance metrics were the average of each of the performance measures obtained from 200 bootstrap iterations, except for Scheme 7. For Scheme 7 that used an NB method, 50 simulated datasets and 50 bootstrap iterations were utilized due to extremely high computational time.

Note that, in the recent literature, the log-loss [20] (also known as logistic loss or cross entropy loss) is frequently used to evaluate the classifiers. However, we chose to focus on the Brier score for three reasons. First, the log loss has a range of 0 to infinity, while the Brier score ranges from 0 to 1. The Brier score would be more informative when comparing the schemes because it has definite lower and upper limits. Second, the mathematical formula of the Brier score is easier to understand compared to that of the log loss. Third, we confirmed that, for the schemes in this study, there is a positive linear relationship between the Brier score and log loss (Supplementary Fig. 1), signifying that they both convey similar information. Table 2 shows the concise mathematical expressions of performance metrics under consideration in this study. For further details, we refer the readers to Fergus and Chalmers (2022) [21].

In addition to performance metrics, we compared the 10 or 30 most important biomarkers for distinguishing between the groups selected by each scheme. When the LASSO was the final statistical method, the importance was determined by the absolute value of each biomarker’s coefficient produced in the final regression model. When RF or NB was used as a final statistical method, then the pairwise AUC [22] was used to determine the importance of each biomarker.

Table 2 List of performance metrics used to compare analytic schemes

Full size table

Sensitivity analysis: sample size

Additionally, we varied the sample size of the simulated data (200–1600) to assess whether the results behave differently depending on the sample size. We used the same model parameters and distributions to simulate the modified synthetic data. Using the same procedure as described above, the important biomarkers for discriminating between the two groups and the performance metrics were identified. Given the considerable computation time required to complete the analysis, the sensitivity analysis in this study was limited to the utilization of Schemes 1, 2, and 5. For Scheme 1, the sample sizes of the simulated datasets were 200, 400, 1000, and 1600. For Scheme 2, the sample sizes of the simulated datasets were 200, 600, 1000, and 1600. For Scheme 5, the sample sizes of the simulated datasets were limited to 200, 600, and 1600.

We also evaluated a sample size of 18 subjects (Scheme 1 only) to mimic sample size in pilot data in proteomics studies. While such studies often include fewer than five subjects per group, we focus on the scale of omics data used in clinical studies and randomized controlled trials. Such omics data typically include more than 10 but fewer than 30 subjects per group, along with a significantly larger number of proteins. Furthermore, we utilized widely used and currently available R software packages for variable selection and classification, such as glmnet [25], which requires a minimum of 9 subjects per group.

Biological pathways and related diseases

The initial SOMAScan dataset included a distinct identifier for each protein based on the UniProt database [26] (Uniprot ID). Given the typically lengthy names of proteins, we opted to employ their corresponding gene names. We employed two online databases to ascertain the primary biological pathways and genetic diseases that exhibited the highest associations with the top 10 or 30 selected biomarkers from each scheme. First, we utilized the Reactome database (https://reactome.org/) to identify the predominant biological pathway among the biomarkers selected from each scheme. Reactome is a pathway database that is open-source, open access, meticulously curated, and undergoes peer review. Second, we used the DAVID database [27] to identify the most frequent genetic disorders to which the selected biomarkers from each scheme contribute the most.

Results

Similarity in performance metrics

Table 3 shows the summary of performance metrics of each scheme considered in this study. Note that the computational time for Scheme 7, which employed the NB method, was significantly longer compared to the other methods. As a result, the number of simulated datasets (i.e., 50) and the number of iterations (i.e., 50) for the bootstrap step was restricted. Due to this disparity, the performance metrics of Scheme 7 was not directly compared to those of the other schemes.

In our case, the ratio of the number of patients (\(\:n\)) to the number of proteins (\(\:p\)) is only 2% (n/p = 26/1,317 = 0.02), which is significantly lower than the \(\:n\)-\(\:p\) ratio typically employed in research related to cancer biomarkers [28]. All schemes considered here produced an AUC of (approximately) 1, potentially representing the overfitting caused by the so-called curse of dimensionality [29]. The lowest Brier score was achieved by Scheme 3, which employed unsupervised machine learning methods for dimension reduction and restricted the number of biomarkers to a maximum of 300 for the LASSO regression. However, considering that the largest difference in the Brier score between the schemes was 0.088, the Brier scores were similar across all the schemes.

Scheme 8 and Scheme 9 differed solely in the number of iterations involved in the variable removal process. In Scheme 9, the approach entailed two rounds of dimensional reduction: first utilizing the correlation coefficient criterion, followed by a subsequent reduction based on the AUC. This sequential reduction strategy aims to maximize the \(\:n\)-\(\:p\) ratio for LASSO, which is the final step in the process. However, despite resulting in a different set of selected biomarkers, this additional variable removal process did not contribute to any improvement in performance metrics.

The distinction between Scheme 5 and Scheme 8 lied solely in the choice of variable removal method, either AUC or correlation coefficients. However, even with this variation, there were no significant disparities observed in the performance metrics.

Table 3 Summary of performance metrics based on simulated data. Spec: specificity. Sens: sensitivity. B. Acc: balanced accuracy. CI: confidence interval

Full size table

Heterogeneity of selected proteins

While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory demonstrated a substantial level of heterogeneity. Table 4 and Supplementary Table 1 show a set of 10 and 30 selected proteins that best discriminate the two subject groups, respectively. Note that Scheme 3 had a maximum of 14 selected biomarkers, dictated by the algorithms used. As a result, when comparing the selected biomarkers between Scheme 3 and other schemes, we focused on the top 10 selections. When comparing the selected biomarkers between these schemes and others, we determined the count of overlapped biomarkers instead of comparing overlapping rates. Excluding this specific scheme, we conducted a comparison using the top 30 selected biomarkers, where we calculated the overlapping rates between the two schemes. Supplementary Table 5 displays the rates or counts of overlapping biomarkers for every pair of schemes considered in this study.

Schemes 3 and 4, which used unsupervised ML techniques for dimensional reduction, yielded a different collection of selected biomarkers, with only one overlapping biomarker out of the top 10 selected biomarkers. As Table 5 demonstrates, Scheme 8 and 9, which differed solely in the number of the variable removal processes, had an overlapping rate of 60% out of the top 30 selected biomarkers. In fact, it was the highest overlapping rate among all possible combinations of two schemes. Scheme 5 and 8, which differed solely in the choice of variable removal method, had an overlapping rate of 43% out of the top 30 selected biomarkers. This indicates that the choice of variable removal process and the number of iterations employed prior to applying the final regression model impact the selection of biomarkers. Schemes 5, 6, and 7 adopt distinct supervised ML methods as the final step, following a single round of variable removal using the same method. In this case, the selected variables among these schemes exhibit small overlap.

The 10 biomarkers that appeared most frequently across all schemes were HAAO, AIMP1, REG1A, AMH, CSNK2B, PRSS22, SECTM1, CNDP1, CST5, and EFEMP1, listed in order of frequency (Supplementary Fig. 2).

Table 4 Ten selected biomarkers that best discriminate two groups of interest. The biomarkers are shown from left to right and top to bottom in order of discriminatory power

Full size table

Table 5 The top 5 combinations of two schemes that yielded the highest rates or counts of overlapping biomarkers from both schemes, after taking into account 30 selected biomarkers that best discriminate two groups of interest

Full size table

Similarity in biological pathways and associated diseases

Table 6 presents a concise overview of the two most relevant biological pathways and genetic disorders that display significant associations with the top 10 or 30 selected biomarkers obtained from each scheme. Except for Schemes 2 and 7, the selected biomarkers from all schemes demonstrated a strong relationship with genetic disorders resulting from compromised immune response. While the genetic disorders associated with Scheme 2 were not directly influenced by immune function, it was noteworthy that the second most important biological pathway identified was the immune system. In Scheme 7, one of the predominant biological pathways was the L1CAM interactions, which is known to be expressed abnormally in several tumor types [30]. In essence, despite the heterogeneity in the selected biomarkers, we identified common biological pathways and relevant genetic disorders derived from them.

Table 6 Two most related biological pathways and associated diseases given a set of selected biomarkers

Full size table

Increasing overlapping rate in data with a larger sample size

The initial dataset comprised a total of 26 subjects, with a \(\:n\)-\(\:p\) ratio of 2%. As demonstrated by Table 7, Supplementary Tables 2–4, it was observed that within a single scheme, the top 10 selected biomarkers exhibited increased stability as the \(\:n\)-\(\:p\) ratio progressively increased. However, despite surpassing a \(\:n\)-\(\:p\) ratio of 100%, significant heterogeneity persisted in the biomarker selection across different schemes. This indicates that while the stability of the model improves with higher \(\:\text{n}\)-\(\:p\) ratios, the choice of statistical techniques has a substantial impact on biomarker selection.

Table 7 Sensitivity analysis with scheme 1. Top row: sample size (n-p ratio), overlapping rate with preceding sample size. The shaded cells contain biomarkers that were chosen in the sensitivity analysis conducted with a smaller sample size, e.g., AURKB was chosen in analysis with n = 160 and n = 26. The biomarkers are shown from top to bottom in order of discriminatory power

Full size table

Discussion

In a clinical setting, the primary goal of most biomarker-based studies is to identify a limited number of biomarkers that are most predictive of disease, its progression, or response to treatment. In this study, we aimed to compare the performance of prevalent statistical methodologies and analysis pipelines during the process of biomarker selection for optimal patient group prediction using exclusively SOMAScan data with a limited number of subjects. Statistical methods used for biomarker selection in this study included both supervised and unsupervised ML methods. By utilizing the aforementioned methods either individually or in conjunction, a total of 9 distinct analytical approaches were formulated and subsequently assessed for their respective model performances. Note that the final results were derived from 200 simulated datasets and 200 bootstrap iterations within a single statistical model (with the exception of Scheme 7, which had 50 simulated datasets and 50 bootstrap iterations). The choice of the number of simulated datasets and bootstrap iterations was predetermined with consideration of the running time required for one bootstrap iteration, which ranged from around 25 s to 20 min on an Apple MacBook equipped with an Apple M2 processor and 16GB of RAM.

Upon evaluating performance metrics, it was observed that, when the \(\:n\)-\(\:p\) ratio was only 2%, all schemes considered in this study resulted in an exceptionally high level of performance metrics. However, these high values do not necessarily suggest that the statistical models were exceptionally adept at distinguishing between patient groups. Instead, they appear to indicate a potential issue with overfitting. This suggests that, when dealing with a limited sample size and/or a small \(\:n\)-\(\:p\) ratio, using ML techniques to select important biomarkers and quantify their ability to differentiate between groups of interest requires careful consideration, because the issue of overfitting may arise. Note that, when each scheme was fitted to the original SOMAScan dataset, overfitting was consistently observed (Supplementary Table 6). However, Scheme 2, which exclusively employed SVM, exhibited relatively lower sensitivity and prediction accuracy compared to other schemes. SVM is notoriously computationally intensive and highly sensitive to parameter choices compared to other ML techniques [31, 32]. In this study, due to realistic computation time, significant limitations were imposed on the range of parameter values that could be estimated. As a result, the challenges associated with SVM may have become more apparent, potentially resulting in relatively lower performance metrics. As Supplementary Table 6 implies, this may have been particularly notable when applying Scheme 2 to the original SOMAScan dataset as opposed to multiple simulated datasets.

Despite the similarities in performance metrics, we observed a substantial degree of heterogeneity in the selected biomarkers across each scheme. The same phenomenon was observed when all schemes were employed on the original SOMAScan dataset. This heterogeneity underscores the caution that must be taken in the selection of appropriate statistical methodologies for biomarker selection when dealing with proteomics data with a limited sample size less than 30. Schemes 3 and 4, which used unsupervised ML techniques for dimensional reduction, yielded different results in the selected biomarkers with a low overlapping rate. The combination of Schemes 8 and 9, which only differed in the number of the variable removal processes, achieved the highest rate of overlap at 60% among the combinations of two schemes. It implies that the addition of the biomarker removal process does not substantially affect the selected biomarker set. In case of Schemes 5, 6, and 7, where different supervised ML methods were employed as the final step, the biomarkers selected within these schemes exhibited minimal overlap. This phenomenon suggests that although ML algorithms may yield similar performance metrics, there is a notable discrepancy in the identification of the most discriminatory biomarkers. In other words, it implies that, purely from a statistical modeling perspective, we cannot endorse one particular statistical model over another. However, potential model selection could be performed by discussing the identified discriminatory biomarkers with bioinformaticians and then choosing the statistical model that produced biomarkers more suitable for the clinical context as the final model.

While a substantial heterogeneity was observed in the biomarkers selected among different schemes, there was a congruence in the biological pathways and associated diseases associated with the selected biomarkers across these schemes. The biomarkers selected in the majority of schemes exhibited a strong relationship with genetic disorders and/or biological pathways associated with the immune system. In Scheme 7, the L1CAM interactions was one of the predominant biological pathways. Due to the potential critical role of L1CAM in tumor progression [33], it is expected that it will have an indirect impact on the immune system or be influenced by it.

As the \(\:n\)-\(\:p\) ratio increased, so did the stability of selected biomarkers. However, despite a sufficiently large \(\:\text{n}\)-\(\:p\) ratio, significant heterogeneity persisted in the biomarker selection across all schemes considered in this study. This finding strongly reinforces our fundamental proposition that, regardless of the sample sizes, when the primary objective of a study is to identify a biomarker that contributes to distinguishing between patient and control groups (or groups with other distinct characteristics), careful consideration must be given to the selection of appropriate statistical techniques and the design of the analytical procedure.

There are the following four caveats to consider. First, this study yielded findings based on multiple simulated datasets that were based on a single observed dataset. Note that, in the process of generating simulated datasets from SOMAScan data, we utilized summary statistics for each protein to ensure that the simulated datasets resembled the observed data. However, for the sake of simplicity and feasibility, we utilized a truncated normal distribution to generate the data. Consequently, it is possible that the initial distribution of certain proteins might have been slightly distorted in the process. This should not affect the comparisons between the schemes that utilized simulated datasets. However, it may have introduced some variability when comparing them to the schemes that relied solely on observed SOMAScan data. Second, this study exclusively employed ML techniques, which are currently prevalent in the analysis of proteomics data and are conveniently executable through R packages. There exist numerous ML methodologies for feature selection that might outperform those examined in this study. However, their applicability is sometimes hindered by the absence of adequate software support. Third, the comparison between the schemes considered in this study involved simulation-based and heuristic assessments, without delving into a mathematical framework to quantitatively gauge the distinctions. A promising extension of this study would be to demonstrate theoretical differences among the schemes. Last, we defined a dataset with 26 subjects as extremely small. However, we acknowledge that pilot data, which also need to be analyzed for larger studies, sometimes include fewer than five subjects per group. This study does not consider data with fewer than 8 subjects per group, primarily because the optimizers of the R packages we used do not reliably converge with too few events. For analyses with fewer than 8 subjects, variable selection and classification could be performed using a Bayesian framework; however, this exceeds the scope of our current study but is a potential extension of the work.

In spite of the caveats discussed, this study holds significance as it brings new insights into formulating a pipeline for analyzing proteomics data with constraints posed by limited sample sizes. Based on the current findings, we suggest the following. If the ultimate objective of the research is to identify a statistical model that best distinguishes between cohort groups using proteomics data, it is recommended that increasing the \(\:\text{n}\)-\(\:p\) ratio should be a primary focus before employing ML techniques to avoid overfitting. When the principal aim of the study is to uncover the biological pathways and genetic disorders common among the selected proteins, the selection of statistical techniques can be regarded as relatively flexible. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and subsequently utilize them for subsequent investigations (e.g., devising a diagnostic tool), meticulous consideration is necessary when opting for statistical models. This is due to the possibility of heterogeneity in the sets of selected proteins, contingent on the particular statistical model incorporated into the analytical plan.

Data availability

The observed data, whose summary statistic was used to generate simulated datasets, is available upon request from Chandra Mohan (cmohan@central.uh.edu). The R code for simulation is available upon request from Kyung Hyun Lee (Kyung.Hyun.Lee@uth.tmc.edu).

Abbreviations

ML:: Machine Learning
RF:: Random Forest
AUC:: Area Under the Curve
CRC:: Colorectal Cancer
KNN:: K-nearest Neighbor
SVM:: Support Vector Machine
PCA:: Principal Component Analysis
CI:: Confidence Interval
ROC:: Receiver operating characteristic
NB:: Naïve Bayes
GM:: Gaussian Mixture
BIC:: Bayesian Information Criterion
LOOCV:: Leave-One-Out Cross-Validation

References

Sakamoto S, Putalun W, Vimolmangkang S, et al. Enzyme-linked immunosorbent assay for the quantitative/qualitative analysis of plant secondary metabolites. J Nat Med. 2018;72:32–42.
Article CAS PubMed Google Scholar
Giudice V, Biancotto A, Wu Z, et al. Aptamer-based proteomics of serum and plasma in acquired aplastic anemia. Exp Hematol. 2018;68:38–50.
Article CAS PubMed PubMed Central Google Scholar
Ostroff R, Foreman T, Keeney TR, et al. The stability of the circulating human proteome to variations in sample collection and handling procedures measured with an aptamer-based proteomics array. J Proteom. 2010;73:649–66.
Article CAS Google Scholar
Toh SY, Citartan M, Gopinath SC, et al. Aptamers as a replacement for antibodies in enzyme-linked immunosorbent assay. Biosens Bioelectron. 2015;64:392–403.
Article CAS PubMed Google Scholar
Lakhin A, Tarantul V, Gening L. Aptamers: problems, solutions and prospects. Acta Naturae Англоязычная Версия; 5.
Christiansson L, Mustjoki S, Simonsson B, et al. The use of multiplex platforms for absolute and relative protein quantification of clinical material. EuPA Open Proteom. 2014;3:37–47.
Article CAS Google Scholar
Coombs KM. Update on proteomic approaches to uncovering virus-induced protein alterations and virus–host protein interactions during the progression of viral infection. Expert Rev Proteom. 2020;17:513–32.
Article CAS Google Scholar
Tanaka T, Lavery R, Varma V, et al. Plasma proteomic signatures predict dementia and cognitive impairment. Alzheimers Dement Transl Res Clin Interv. 2020;6:e12018.
Article Google Scholar
Kollar B, Shubin A, Borges TJ, et al. Increased levels of circulating MMP3 correlate with severe rejection in face transplantation. Sci Rep. 2018;8:1–12.
Article CAS Google Scholar
Li H, Vanarsa K, Zhang T, Soomro S, Cicalese PA, Duran V, et al. Comprehensive aptamer-based screen of 1317 proteins uncovers improved stool protein markers of colorectal cancer. J gastroenterolo. 2021;56(7):659–72.
Soomro S, Venkateswaran S, Vanarsa K, et al. Predicting disease course in ulcerative colitis using stool proteins identified through an aptamer-based screen. Nat Commun. 2021;12:1–11.
Article Google Scholar
Tawalbeh SM, Marin W, Morgan GA, et al. Serum protein biomarkers for juvenile dermatomyositis: a pilot study. BMC Rheumatol. 2020;4:1–15.
Article Google Scholar
Nakayasu ES, Gritsenko M, Piehowski PD, et al. Tutorial: best practices and considerations for mass-spectrometry-based protein biomarker discovery and validation. Nat Protoc. 2021;16:3737–60.
Article CAS PubMed PubMed Central Google Scholar
Kaur G, Roy I. Therapeutic applications of aptamers. Expert Opin Investig Drugs. 2008;17:43–60.
Article CAS PubMed Google Scholar
Cerchia L, De Franciscis V. Targeting cancer cells with nucleic acid aptamers. Trends Biotechnol. 2010;28:517–25.
Article CAS PubMed Google Scholar
Kim J, Hu J, Sollie RS, et al. Improvement of sensitivity and dynamic range in proximity ligation assays by asymmetric connector hybridization. Anal Chem. 2010;82:6976–82.
Article CAS PubMed Google Scholar
Ma H, Liu J, Ali MM, et al. Nucleic acid aptamers in cancer research, diagnosis and therapy. Chem Soc Rev. 2015;44:1240–56.
Article CAS PubMed Google Scholar
Bhattacjarjee S, Bhattacharjee MS. Package ‘tmvnsim’.
Steyerberg EW, Harrell FE Jr, Borsboom GJ, et al. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol. 2001;54:774–81.
Article CAS PubMed Google Scholar
Murphy KP. Machine learning: a probabilistic perspective. MIT Press; 2012.
Fergus, Paul, and Carl Chalmers. –Performance evaluation metrics.” Applied deep learning: Tools, Techniques, and Implementation. Cham: Springer International Publishing, 2022;115–38.
Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008;28:1–26.
Article Google Scholar
Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45:171–86.
Article Google Scholar
Ling CX, Huang J, Zhang H. AUC: a statistically consistent and more discriminating measure than accuracy; 2003. pp. 519–524.
Hastie T. An introduction to glmnet. Vignette Doc R Glmnet Package.
UniProt. https://www.uniprot.org/ (accessed 21 2023).
Dennis G, Sherman BT, Hosack DA, et al. DAVID: database for annotation, visualization, and integrated discovery. Genome Biol. 2003;4:1–11.
Article Google Scholar
Al-Tashi Q, Saad MB, Muneer A, et al. Machine learning models for the identification of prognostic and predictive Cancer biomarkers: a systematic review. Int J Mol Sci. 2023;24:7781.
Article CAS PubMed PubMed Central Google Scholar
Bzdok D, Krzywinski M, Altman N. Machine learning: a primer. Nat Methods. 2017;14:1119.
Article CAS PubMed PubMed Central Google Scholar
Giordano M, Cavallaro U. Different shades of L1CAM in the pathophysiology of cancer stem cells. J Clin Med. 2020;9:1502.
Article CAS PubMed PubMed Central Google Scholar
Cervantes J, Garcia-Lamont F, Rodríguez-Mazahua L, et al. A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing. 2020;408:189–215.
Article Google Scholar
Rodríguez-Pérez R, Bajorath J. Evolution of support vector machine and regression modeling in chemoinformatics and drug discovery. J Comput Aided Mol Des. 2022;36:355–62.
Article PubMed PubMed Central Google Scholar
Wang B, Li J, Li X, et al. Identifying prognosis and metastasis–associated genes associated with ewing sarcoma by weighted gene co–expression network analysis. Oncol Lett. 2019;18:3527–36.
CAS PubMed PubMed Central Google Scholar

Download references

Funding

This work is supported by the Institutional Scholar Awards (project number 1UL1TR003137) from the University of Texas Health Science Center at Houston.

Author information

Authors and Affiliations

Institute for Clinical Research and Learning Health Care, Department of Pediatrics, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, USA
Kyung Hyun Lee & Claudia Pedroza
Department of Internal Medicine – Rheumatology, University of Texas Health Science Center at Houston, Houston, TX, USA
Shervin Assassi
Department of Biomedical Engineering, University of Houston, Houston, TX, USA
Chandra Mohan

Authors

Kyung Hyun Lee
View author publications
You can also search for this author inPubMed Google Scholar
Shervin Assassi
View author publications
You can also search for this author inPubMed Google Scholar
Chandra Mohan
View author publications
You can also search for this author inPubMed Google Scholar
Claudia Pedroza
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

KL and CP designed the study. KL analyzed the data and performed the simulation studies. KL and CP drafted and revised the manuscript. CM provided proteomics data in the required format. SA provided technical support in identifying biological pathways and associated diseases. CM and SA revised the manuscript. All authors agreed to be responsible for all aspect of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Kyung Hyun Lee.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lee, K., Assassi, S., Mohan, C. et al. Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study. BMC Genomics 25, 1086 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-024-11018-2

Download citation

Received: 09 April 2024
Accepted: 08 November 2024
Published: 14 November 2024
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-024-11018-2

You are viewing the site in preview mode

Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study

Abstract

Background

Results

Conclusions

Introduction

Hypothesis

Data and methods

Analytic schemes under consideration

Data description

Unsupervised and supervised machine learning methods

Details of each scheme

Performance metrics and biomarker selection

Sensitivity analysis: sample size

Biological pathways and related diseases

Results

Similarity in performance metrics

Heterogeneity of selected proteins

Similarity in biological pathways and associated diseases

Increasing overlapping rate in data with a larger sample size

Discussion

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Electronic supplementary material

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us