TY - GEN
T1 - Species identification based on approximate matching
AU - Patil, Nagamma
AU - Toshniwal, Durga
AU - Garg, Kumkum
PY - 2011/6/9
Y1 - 2011/6/9
N2 - Genomic data mining and knowledge extraction is an important problem in bioinformatics. Existing methods for species identification are based on n-grams. In this paper, we propose a novel approach for identification of species. Given a database of genomic sequences, our proposed work includes extraction of all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These patterns are used as features for classifier. Classification of genome sequences has been done by using data mining techniques namely, Naive Bayes, support vector machine and nearest neighbor. Individual classifier accuracies are reported. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely E. coli and Yeast are used to verify proposed method.
AB - Genomic data mining and knowledge extraction is an important problem in bioinformatics. Existing methods for species identification are based on n-grams. In this paper, we propose a novel approach for identification of species. Given a database of genomic sequences, our proposed work includes extraction of all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These patterns are used as features for classifier. Classification of genome sequences has been done by using data mining techniques namely, Naive Bayes, support vector machine and nearest neighbor. Individual classifier accuracies are reported. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely E. coli and Yeast are used to verify proposed method.
UR - https://www.scopus.com/pages/publications/79957996336
UR - https://www.scopus.com/pages/publications/79957996336#tab=citedBy
U2 - 10.1145/1980422.1980452
DO - 10.1145/1980422.1980452
M3 - Conference contribution
AN - SCOPUS:79957996336
SN - 9781450307505
T3 - Compute 2011 - 4th Annual ACM Bangalore Conference
BT - Compute 2011 - 4th Annual ACM Bangalore Conference
T2 - 4th Annual ACM Bangalore Conference, Compute 2011
Y2 - 25 March 2011 through 26 March 2011
ER -