Apache tez: A unifying framework for modeling and building data processing applications

Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, Carlo Curino

Research output: Chapter in Book/Report/Conference proceedingConference contribution

151 Citations (Scopus)


The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts-with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided an d mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.

Original languageEnglish
Title of host publicationSIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery (ACM)
Number of pages13
ISBN (Electronic)9781450327589
Publication statusPublished - 27-05-2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: 31-05-201504-06-2015


ConferenceACM SIGMOD International Conference on Management of Data, SIGMOD 2015

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems


Dive into the research topics of 'Apache tez: A unifying framework for modeling and building data processing applications'. Together they form a unique fingerprint.

Cite this