Secure EMR Classification and Deduplication using MapReduce

Usharani A. V, Girija Attigeri

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)


Healthcare providers generate huge amount of data every day through registration, lab results, prescriptions, and others. This is stored in the form of Electronic Medical Records (EMR) in a central repository. A medical record data is very huge, difficult to read and understand. To give an insight to the professionals in analyzing the different domains a patient belongs to, it is necessary to get pointers to a file before classifying it to a particular department for further analysis. This study provides a EMR processing system to automatically classify EMRs based on the important medical terms using TF-IDF and topic modeling. Automatic Classification of EMRs help the healthcare professionals in taking accurate decisions, providing efficient service, and improves the time taken for processing huge amount of data and in better organizing of patient. The data stored on the cloud may contain duplicate copies of EMR on several storage systems at file level thus increasing the network bandwidth, cost, and consuming storage space. Hence, a deduplication mechanism is required to avoid or reduce the data redundancy. Adapting cloud computing for healthcare systems necessitates sharing patient data with cloud service providers, which creates security concerns as the data may contain diagnosis, medication, laboratory results and medical claims. The main aim of this work is to classify the EMRs as per the specialization using KNN algorithm, optimize storage using deduplication and protect the data using DNA encryption algorithm before uploading to Hadoop. Data redundancy is taken care by implementing deduplication techniques using MD5 hashing. Proposed methodology shows an accuracy of 90% for EMR record classification and handles duplication and security aspects. This in-turn proves the state of the art approach for health care data management.

Original languageEnglish
Pages (from-to)34404-34414
Number of pages11
JournalIEEE Access
Publication statusPublished - 2022

All Science Journal Classification (ASJC) codes

  • Computer Science(all)
  • Materials Science(all)
  • Engineering(all)


Dive into the research topics of 'Secure EMR Classification and Deduplication using MapReduce'. Together they form a unique fingerprint.

Cite this