Abstract
Automated performance tuning is a tricky task for a large scale storage system. Traditional methods highly reply on experience of system administrators and cannot adapt to changes of working load and system configurations. Reinforcement learning is a promising machine learning paradigm which learns an optimized strategy from the trials and errors between agents and environments. Combining with the strong feature learning capability of deep learning, deep reinforcement learning has showed its success in many fields. We implemented a performance parameter tuning engine based on deep reinforcement learning for Lustre file system, a distributed file system widely used in HEP data centres. Three reinforcement learning algorithms: Deep Q-learning, A2C, and PPO are enabled in the tuning engine. Experiments show that, in a small test bed, with IOzone workload, this method can increase the random read throughput by about 30% compared to default settings of Lustre. In the future, it is possible to apply this method to other parameter tuning use cases of data centre operations.
| Original language | English |
|---|---|
| Article number | 012090 |
| Journal | Journal of Physics: Conference Series |
| Volume | 1525 |
| Issue number | 1 |
| DOIs | |
| State | Published - 7 Jul 2020 |
| Externally published | Yes |
| Event | 19th International Workshop on Advanced Computing and Analysis Techniques in Physics Research, ACAT 2019 - Saas-Fee, Switzerland Duration: 11 Mar 2019 → 15 Mar 2019 |