Abstract
The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts-with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided an d mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.
Original language | English |
---|---|
Title of host publication | SIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data |
Publisher | Association for Computing Machinery (ACM) |
Pages | 1357-1369 |
Number of pages | 13 |
Volume | 2015-May |
ISBN (Electronic) | 9781450327589 |
DOIs | |
Publication status | Published - 27-05-2015 |
Event | ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia Duration: 31-05-2015 → 04-06-2015 |
Conference
Conference | ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 |
---|---|
Country/Territory | Australia |
City | Melbourne |
Period | 31-05-15 → 04-06-15 |
All Science Journal Classification (ASJC) codes
- Software
- Information Systems