TY - JOUR
T1 - Protein classification by autofluorescence spectral shape analysis using machine learning
AU - Chikkanayakanahalli Mukunda, Darshan
AU - Rodrigues, Jackson
AU - Chandra, Subhash
AU - Mazumder, Nirmal
AU - Vitkin, Alex
AU - Kishore Mahato, Krishna
N1 - Funding Information:
Autofluorescence (AF) is one of the most common analytical tools for detecting proteins and investigating their structural dynamics [23–25]. When excited at ∼280 nm, different proteins produce relatively unique AF spectral shapes based on the varying numbers and spatial arrangements of Trp and Tyr residues in their tertiary structures [24, 26–28]. However, the AF spectra of proteins remain underutilized, and no studies have been reported on utilizing the information of protein AF spectral shapes for classifying them using ML. Therefore, in the present study, we report a simple and quick ML methodology for classifying proteins (15 nos.) using a multiclass Support Vector Machine (SVM) learning model based on their AF spectral features ranked and selected by the Minimum Redundancy Maximum Relevance (mRMR) algorithm. SVM is widely used to classify spectroscopic data (Raman, Infrared, Fluorescence, Photoacoustic, etc.) [9,11,12,29] because of its nonlinear processing steps and better outputs, even when the data contains large variability [30]. In addition, the data generalization capability of SVM, supported by its solid theoretical foundation, prevents data overfitting [11,31]. SVM can be used in binary and multiclass analyses [9,32]. It separates the classes using a hyperplane created by support vectors and margins. The margin is the maximum value of separating the hyperplanes belonging to different classes. The support vectors are the data points that are closer to the hyperplane. If the points in support vectors change, the hyperplane position also changes accordingly [9,11,29]. SVM uses kernel functions such as RBF, Polynomial, and Linear. The RBF and Polynomial are used for nonlinear classifications, whereas the Linear is used for linear classifications. The combined AF and the ML used in the current study hold promise to distinguish and unambiguously classify various proteins even with substantively similar AF spectra.The authors would like to thank SERB, DST, Govt. of India, New Delhi (Sanction ID: EMR/2016/007700) for funding the project and Manipal School of Life Sciences, Manipal Academy of Higher Education (MAHE), Manipal, India, DBT-BUILDER (BT/INF/22/SP43065/2021) Govt. of India for infrastructure and facilities. DCM and JR thank the Indian Council of Medical Research (ICMR), Government of India, New Delhi, for support through the Senior Research Fellowships (Sanctions Nos. 5/3/8/14/ITR-F/2020-ITR and 5/3/8/45/ITR-F/2019-ITR respectively). KKM would like to thank the ICMR for financial support (Ref. 17x(3)/Adhoc/33/2022-ITR).
Funding Information:
The authors would like to thank SERB, DST, Govt. of India, New Delhi (Sanction ID: EMR/2016/007700) for funding the project and Manipal School of Life Sciences, Manipal Academy of Higher Education (MAHE), Manipal, India, DBT-BUILDER (BT/INF/22/SP43065/2021) Govt. of India for infrastructure and facilities. DCM and JR thank the Indian Council of Medical Research ( ICMR ), Government of India, New Delhi, for support through the Senior Research Fellowships (Sanctions Nos. 5/3/8/14/ITR-F/2020-ITR and 5/3/8/45/ITR-F/2019-ITR respectively). KKM would like to thank the ICMR for financial support (Ref. 17x(3)/Adhoc/33/2022-ITR ).
Publisher Copyright:
© 2023
PY - 2024/1/15
Y1 - 2024/1/15
N2 - Depending on the relative numbers and spatial arrangement of Tryptophan (Trp; W) and Tyrosine (Tyr; Y) residues, different proteins produce distinct autofluorescence (AF) spectral shapes when excited at ∼280 nm. Yet, considering the vast number and heterogeneous forms in nature, visual analysis and precise identification of proteins based on their AF spectra is challenging and further compounded in cases when different proteins produce substantially similar AF spectral shapes. There is, thus, a serious need to develop a methodology to address this problem. The current study proposes a practical technology to quickly identify proteins using machine learning (ML) algorithms based on their AF spectra. Specifically, AF spectra of fifteen different standard proteins of varying origin with distinct structural and Trp/Tyr compositions were recorded; based on the spectral features selected by the Minimum-Redundancy-Maximum-Relevance (mRMR) algorithm, a multiclass Support Vector Machine (SVM) learning model with Radial Basis Function (RBF), Polynomial, and Linear kernels classified the proteins with high accuracy of 99.06%, 99.03%, and 98.29% respectively. Since protein identification is the key to understand biological functions and disease diagnosis, the proposed methodology could offer a viable alternative to and improve the existing protein identification techniques.
AB - Depending on the relative numbers and spatial arrangement of Tryptophan (Trp; W) and Tyrosine (Tyr; Y) residues, different proteins produce distinct autofluorescence (AF) spectral shapes when excited at ∼280 nm. Yet, considering the vast number and heterogeneous forms in nature, visual analysis and precise identification of proteins based on their AF spectra is challenging and further compounded in cases when different proteins produce substantially similar AF spectral shapes. There is, thus, a serious need to develop a methodology to address this problem. The current study proposes a practical technology to quickly identify proteins using machine learning (ML) algorithms based on their AF spectra. Specifically, AF spectra of fifteen different standard proteins of varying origin with distinct structural and Trp/Tyr compositions were recorded; based on the spectral features selected by the Minimum-Redundancy-Maximum-Relevance (mRMR) algorithm, a multiclass Support Vector Machine (SVM) learning model with Radial Basis Function (RBF), Polynomial, and Linear kernels classified the proteins with high accuracy of 99.06%, 99.03%, and 98.29% respectively. Since protein identification is the key to understand biological functions and disease diagnosis, the proposed methodology could offer a viable alternative to and improve the existing protein identification techniques.
UR - http://www.scopus.com/inward/record.url?scp=85171133420&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85171133420&partnerID=8YFLogxK
U2 - 10.1016/j.talanta.2023.125167
DO - 10.1016/j.talanta.2023.125167
M3 - Article
AN - SCOPUS:85171133420
SN - 0039-9140
VL - 267
JO - Talanta
JF - Talanta
M1 - 125167
ER -