TY - GEN
T1 - Mining sequential patterns from protein sequences associated with aggregation diseases
AU - Anup Bhat, B.
AU - Sunilkumar, Tanish
AU - Prabhu, Ayush
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Protein aggregation is a hallmark of several neurodegenerative diseases. Mining patterns of amino acids from protein sequences aid in identifying regions of interest that can be validated clinically. Only a few studies have demonstrated the feasibility of employing Frequent Itemset Mining (FIM) algorithms for this task. However, these algorithms are not only computationally expensive but do not provide due consideration to the ordering of amino acids within a protein sequence. Apart from this, the number of patterns obtained remains sensitive to the input minimum support threshold. To overcome these limitations, the current study focuses on mining sequential patterns using the Top-k Sequential pattern mining algorithm that not only preserves the amino acid sequence but is also independent of any user input threshold. Across various protein sequences, the obtained sequential patterns were compared for similarity with frequent patterns by retaining as well as removing repeating amino acids. On average, about 89.31% non-repeating and 68.08% repeating sequential patterns were similar to the frequent patterns. Furthermore, a Jaccard Index close to 0.58 and 0.48 signifies the proximity of the sequential patterns with the frequent ones despite the absence of user-defined thresholds.
AB - Protein aggregation is a hallmark of several neurodegenerative diseases. Mining patterns of amino acids from protein sequences aid in identifying regions of interest that can be validated clinically. Only a few studies have demonstrated the feasibility of employing Frequent Itemset Mining (FIM) algorithms for this task. However, these algorithms are not only computationally expensive but do not provide due consideration to the ordering of amino acids within a protein sequence. Apart from this, the number of patterns obtained remains sensitive to the input minimum support threshold. To overcome these limitations, the current study focuses on mining sequential patterns using the Top-k Sequential pattern mining algorithm that not only preserves the amino acid sequence but is also independent of any user input threshold. Across various protein sequences, the obtained sequential patterns were compared for similarity with frequent patterns by retaining as well as removing repeating amino acids. On average, about 89.31% non-repeating and 68.08% repeating sequential patterns were similar to the frequent patterns. Furthermore, a Jaccard Index close to 0.58 and 0.48 signifies the proximity of the sequential patterns with the frequent ones despite the absence of user-defined thresholds.
UR - https://www.scopus.com/pages/publications/85210242288
UR - https://www.scopus.com/pages/publications/85210242288#tab=citedBy
U2 - 10.1109/AIC61668.2024.10730824
DO - 10.1109/AIC61668.2024.10730824
M3 - Conference contribution
AN - SCOPUS:85210242288
T3 - 2024 IEEE 3rd World Conference on Applied Intelligence and Computing, AIC 2024
SP - 184
EP - 189
BT - 2024 IEEE 3rd World Conference on Applied Intelligence and Computing, AIC 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE World Conference on Applied Intelligence and Computing, AIC 2024
Y2 - 27 June 2024 through 28 June 2024
ER -