Development of real time analytics of movies review data using PySpark

Research output: Contribution to journalArticlepeer-review


The data play the vital role in every organization. The data can be divided into structured, semi-structured and unstructured. One can not process the unstructured data in real-time using RDBMS or Hadoop. Spark is an extension of Hadoop architecture which clubs the goodness of both Hadoop and Storm. Spark supports languages such as Scala, Java, Python, and R. The proposed method uses PySpark to analyze the movies review dataset of 50000 reviews by 36409 people for 1539 movies in real-time. Since movie reviews are written by many users in real-time, it is necessary for real-time data analysis. This method finds all the users who are very active in writing the reviews of the movies. This analytics may be used for giving incentives to the active reviewers. Further, the information about more popular movies based on reviews can be gained through analytics. To achieve these tasks basic map, reduce and filter functionalities have been applied. It is found from the analytics that the Movie code B002VL2PTU has been reviewed by the maximum number of people and also it is determined that maximum of 112 reviews were written by the single user with code A3LZGLA88K0LA0. The frequency count of words in the movie review is accomplished, and sentiment of the user can be analyzed using unigrams.

Original languageEnglish
Pages (from-to)497-500
Number of pages4
JournalInternational Journal of Recent Technology and Engineering
Issue number6
Publication statusPublished - 01-03-2019

All Science Journal Classification (ASJC) codes

  • Engineering(all)
  • Management of Technology and Innovation


Dive into the research topics of 'Development of real time analytics of movies review data using PySpark'. Together they form a unique fingerprint.

Cite this