TY - GEN
T1 - Apache hadoop YARN
T2 - 4th Annual Symposium on Cloud Computing, SoCC 2013
AU - Vavilapalli, Vinod Kumar
AU - Murthy, Arun C.
AU - Douglas, Chris
AU - Agarwal, Sharad
AU - Konar, Mahadev
AU - Evans, Robert
AU - Graves, Thomas
AU - Lowe, Jason
AU - Shah, Hitesh
AU - Seth, Siddharth
AU - Saha, Bikas
AU - Curino, Carlo
AU - O'Malley, Owen
AU - Radia, Sanjay
AU - Reed, Benjamin
AU - Baldeschwieler, Eric
PY - 2013/1/1
Y1 - 2013/1/1
N2 - The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agorá - the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs' control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task fault-tolerance) to per-application components. We provide experimental evidence demonstrating the improvements we made, confirm improved efficiency by reporting the experience of running YARN on production environments (including 100% of Yahoo! grids), and confirm the flexibility claims by discussing the porting of several programming frameworks onto YARN viz. Dryad, Giraph, Hoya, Hadoop MapReduce, REEF, Spark, Storm, Tez.
AB - The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agorá - the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs' control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task fault-tolerance) to per-application components. We provide experimental evidence demonstrating the improvements we made, confirm improved efficiency by reporting the experience of running YARN on production environments (including 100% of Yahoo! grids), and confirm the flexibility claims by discussing the porting of several programming frameworks onto YARN viz. Dryad, Giraph, Hoya, Hadoop MapReduce, REEF, Spark, Storm, Tez.
UR - http://www.scopus.com/inward/record.url?scp=84893249524&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84893249524&partnerID=8YFLogxK
U2 - 10.1145/2523616.2523633
DO - 10.1145/2523616.2523633
M3 - Conference contribution
AN - SCOPUS:84893249524
SN - 9781450324281
T3 - Proceedings of the 4th Annual Symposium on Cloud Computing, SoCC 2013
BT - Proceedings of the 4th Annual Symposium on Cloud Computing, SoCC 2013
PB - Association for Computing Machinery (ACM)
Y2 - 1 October 2013 through 3 October 2013
ER -