Automated performance tuning of distributed storage system based on deep reinforcement learning

Lu Wang, Wentao Zhang, Yaodong Cheng

Research output: Contribution to journalConference articlepeer-review

Abstract

Automated performance tuning is a tricky task for a large scale storage system. Traditional methods highly reply on experience of system administrators and cannot adapt to changes of working load and system configurations. Reinforcement learning is a promising machine learning paradigm which learns an optimized strategy from the trials and errors between agents and environments. Combining with the strong feature learning capability of deep learning, deep reinforcement learning has showed its success in many fields. We implemented a performance parameter tuning engine based on deep reinforcement learning for Lustre file system, a distributed file system widely used in HEP data centres. Three reinforcement learning algorithms: Deep Q-learning, A2C, and PPO are enabled in the tuning engine. Experiments show that, in a small test bed, with IOzone workload, this method can increase the random read throughput by about 30% compared to default settings of Lustre. In the future, it is possible to apply this method to other parameter tuning use cases of data centre operations.

Original languageEnglish
Article number012090
JournalJournal of Physics: Conference Series
Volume1525
Issue number1
DOIs
StatePublished - 7 Jul 2020
Externally publishedYes
Event19th International Workshop on Advanced Computing and Analysis Techniques in Physics Research, ACAT 2019 - Saas-Fee, Switzerland
Duration: 11 Mar 201915 Mar 2019

Fingerprint

Dive into the research topics of 'Automated performance tuning of distributed storage system based on deep reinforcement learning'. Together they form a unique fingerprint.

Cite this