Learning Scheduling Algorithms for Data Processing Clusters SIGCOMM ’19, August 19-23, 2019, Beijing, China
References
[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G.
Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin
Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for
Large-scale Machine Learning. In Proceedings of the 12
th
USENIX Confer-
ence on Operating Systems Design and Implementation (OSDI). 265–283.
http://dl.acm.org/citation.cfm?id=3026877.3026899
[2]
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained
policy optimization. In Proceedings of the 34
th
International Conference on
Machine Learning-Volume 70. 22–31.
[3]
Ravichandra Addanki, Shaileshh Bojja Venkatakrishnan, Shreyan Gupta, Hongzi
Mao, and Mohammad Alizadeh. 2018. Placeto: Efficient Progressive Device
Placement Optimization. In Proceedings of the 1
st
Machine Learning for Systems
Workshop.
[4]
Sameer Agarwal, Srikanth Kandula, Nicolas Bruno, Ming-Chuan Wu, Ion Stoica,
and Jingren Zhou. 2012. Re-optimizing Data-parallel Computing. In Proceedings
of the 9
th
USENIX Conference on Networked Systems Design and Implementation
(NSDI). 281–294. http://dl.acm.org/citation.cfm?id=2228298.2228327
[5]
Kunal Agrawal, Jing Li, Kefu Lu, and Benjamin Moseley. 2016. Scheduling
parallel DAG jobs online to minimize average flow time. In Proceedings of the
27
th
annual ACM-SIAM symposium on Discrete Algorithms (SODA). Society for
Industrial and Applied Mathematics, 176–189.
[6]
Alibaba. 2017. Cluster data collected from production clusters in Alibaba for
cluster management research. https://github.com/alibaba/clusterdata. (2017).
[7]
Dario Amodei and Danny Hernandez. 2018. AI and Compute.
https://openai.com/blog/ai-and-compute/. (2018).
[8]
Apache Hadoop. 2014. Hadoop Fair Scheduler. (2014). http:
//hadoop.apache.org/common/docs/stable1/fair_scheduler.html
[9]
Apache Spark. 2018. Spark: Dynamic Resource Allocation. (2018). http:
//spark.apache.org/docs/2.2.1/job-scheduling.html#dynamic-resource-allocation
Spark v2.2.1 Documentation.
[10] Apache Tez 2013. Apache Tez Project. https://tez.apache.org/. (2013).
[11]
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The Datacenter
as a Computer: An Introduction to the Design of Warehouse-Scale Machines,
second edition. Synthesis Lectures on Computer Architecture 8, 3 (July 2013).
https://doi.org/10.2200/S00516ED2V01Y201306CAC024
[12]
Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez,
Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo,
Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, Francis Song, Andrew J.
Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey Allen, Charles
Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet
Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018.
Relational inductive biases, deep learning, and graph networks. arXiv preprint
arXiv:1806.01261 (2018).
[13]
Richard Bellman. 1966. Dynamic programming. Science 153, 3731 (1966), 34–37.
[14]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009.
Curriculum learning. In Proceedings of the 26
th
annual International Conference
on Machine Learning (ICML). 41–48.
[15]
Dimitri P Bertsekas and John N Tsitsiklis. 1995. Neuro-dynamic programming:
an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE
Conference on, Vol. 1. IEEE, 560–564.
[16]
Arka A. Bhattacharya, David Culler, Eric Friedman, Ali Ghodsi, Scott Shenker,
and Ion Stoica. 2013. Hierarchical Scheduling for Diverse Datacenter Workloads.
In Proceedings of the 4
th
Annual Symposium on Cloud Computing (SoCC). Article
4, 15 pages. https://doi.org/10.1145/2523616.2523637
[17]
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning.
Springer.
[18]
Robert D Blumofe and Charles E Leiserson. 1999. Scheduling multithreaded
computations by work stealing. Journal of the ACM (JACM) 46, 5 (1999), 720–748.
[19]
Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R.
Henry, Robert Bradshaw, and Nathan Weizenbaum. 2010. FlumeJava: Easy,
Efficient Data-parallel Pipelines. In Proceedings of the 2010 ACM SIGPLAN
Conference on Programming Language Design and Implementation (PLDI).
363–375. https://doi.org/10.1145/1806596.1806638
[20]
Chandra Chekuri, Ashish Goel, Sanjeev Khanna, and Amit Kumar. 2004. Multi-
processor scheduling to minimize flow time with
ε
resource augmentation. In Pro-
ceedings of the 36
th
Annual ACM Symposium on Theory of Computing. 363–372.
[21]
Dilip Chhajed and Timothy J Lowe. 2008. Building intuition: insights from basic
operations management models and principles. Vol. 115. Springer Science &
Business Media.
[22]
Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour,
and Pieter Abbeel. 2018. Model-based reinforcement learning via meta-policy
optimization. arXiv preprint arXiv:1809.05214 (2018).
[23]
Hanjun Dai, Elias B. Khalil, Yuyu Zhang, Bistra Dilkina, andLe Song. 2017. Learn-
ing Combinatorial Optimization Algorithms overGraphs. In Proceedings of the 31
st
Conference on Neural Information Processing Systems (NeurIPS). 6348–6358.
[24]
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolu-
tional Neural Networks on Graphs with Fast Localized Spectral Filtering. arXiv
preprint arXiv:1606.09375 (2016).
[25]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware Sched-
uling for Heterogeneous Datacenters. In Proceedings of the 18
th
International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS). 77–88. https://doi.org/10.1145/2451116.2451125
[26]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient
and QoS-aware Cluster Management. In Proceedings of the 19
th
International
Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS). 127–144. https://doi.org/10.1145/2541940.2541941
[27]
Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter
Abbeel. 2016. RL2: Fast Reinforcement Learning via Slow Reinforcement
Learning. arXiv preprint arXiv:1611.02779 (2016).
[28]
Andrew D Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fon-
seca. 2012. Jockey: guaranteed job latency in data parallel clusters. In Proceedings
of the 7
th
ACM European Conference on Computer Systems (EuroSys).
[29]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic
Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34
th
International Conference on Machine Learning (ICML). 1126–1135.
[30]
Peter Geibel. 2006. Reinforcement learning for MDPs with constraints. In Proceed-
ings of the 17
th
European Conference on Machine Learning (ECML). 646–653.
[31]
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott
Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation
of Multiple Resource Types. In Proceedings of the 8
th
USENIX Sympo-
sium on Networked Systems Design and Implementation (NSDI). 323–336.
http://dl.acm.org/citation.cfm?id=1972457.1972490
[32]
Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2013. Choosy:
max-min fair sharing for datacenter jobs with constraints. In Proceedings of
the 8
th
ACM European Conference on Computer Systems (EuroSys). 365–378.
https://doi.org/10.1145/2465351.2465387
[33]
Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and
Steven Hand. 2016. Firmament: fast, centralized cluster scheduling at scale. In
Proceedings of the 12
th
USENIX Symposium on Operating Systems Design and
Implementation (OSDI). 99–115.
[34]
Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao,
and Aditya Akella. 2014. Multi-resource Packing for Cluster Schedulers. In
Proceedings of the 2014 ACM SIGCOMM Conference (SIGCOMM). 455–466.
[35]
Robert Grandl, Mosharaf Chowdhury, Aditya Akella, and Ganesh Anantha-
narayanan. 2016. Altruistic Scheduling in Multi-resource Clusters. In Proceedings
of the 12
th
USENIX Conference on Operating Systems Design and Implementation
(OSDI). 65–80. http://dl.acm.org/citation.cfm?id=3026877.3026884
[36]
Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan
Kulkarni. 2016. Graphene: Packing and dependency-aware scheduling for
data-parallel clusters. In Proceedings of the 12
th
USENIX Symposium on Operating
Systems Design and Implementation (OSDI). 81–97.
[37]
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. 2004. Variance reduction
techniques for gradient estimates in reinforcement learning. Journal of Machine
Learning Research 5, Nov (2004), 1471–1530.
[38]
Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. 2017. Cooperative
multi-agent control using deep reinforcement learning. In Proceedings of the
2017 International Conference on Autonomous Agents and Multiagent Systems
(AAMAS). 66–83.
[39]
Martin T Hagan, Howard B Demuth, Mark H Beale, and Orlando De Jesús. 1996.
Neural network design. PWS publishing company Boston.
[40]
W Keith Hastings. 1970. Monte Carlo sampling methods using Markov chains
and their applications. Biometrika 1 (1970).
[41]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D
Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A Platform
for Fine-Grained Resource Sharing in the Data Center. In Proceedings of the 8
th
USENIX Conference on Networked Systems Design and Implementation (NSDI).
[42]
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007.
Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In
Proceedings of the 2
nd
ACM SIGOPS/EuroSys European Conference on Computer
Systems (EuroSys). 59–72. https://doi.org/10.1145/1272996.1273005
[43]
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and
Andrew Goldberg. 2009. Quincy: fair scheduling for distributed computing clusters.
In Proceedings of the 22
nd
ACM Symposium on Operating Systems Principles
(SOSP). 261–276. https://doi.org/10.1145/1629575.1629601
[44]
James E. Kelley Jr and Morgan R. Walker. 1959. Critical-path planning and
scheduling. In Proceedings of the Eastern Joint IRE-AIEE-ACM Computer
Conference (EJCC). 160–173.