Machine learning based performance analysis and prediction of jobs on a HPC cluster

Zhengxiong Hou, Shuxin Zhao, Chao Yin, Yunlan Wang, Jianhua Gu, Xingshe Zhou

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Scopus citations

Abstract

There are a lot of middle-class or small-class high-performance computing clusters at universities and research institutes, etc. Large volumes of job logs have been accumulated after many years of operation. In this paper, on the basis of accumulated job logs on a high-performance computing cluster, we examine and analyze the job logs. Then, we study machine learning based performance analysis and prediction methods for parallel jobs. Various machine learning methods such as multivariate linear fitting, artificial neural network are used to build performance prediction models. We compare the errors of each model, and select the optimal prediction model for different users. The experimental results show that we can obtain reasonable prediction accuracy using the selected machine learning algorithms.

Original languageEnglish
Title of host publicationProceedings - 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019
EditorsHui Tian, Hong Shen, Wee Lum Tan
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages247-252
Number of pages6
ISBN (Electronic)9781728126166
DOIs
StatePublished - Dec 2019
Event20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019 - Gold Coast, Australia
Duration: 5 Dec 20197 Dec 2019

Publication series

NameProceedings - 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019

Conference

Conference20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019
Country/TerritoryAustralia
CityGold Coast
Period5/12/197/12/19

Keywords

  • HPC cluster
  • Job log
  • Machine learning
  • Performance analysis
  • Performance prediction

Fingerprint

Dive into the research topics of 'Machine learning based performance analysis and prediction of jobs on a HPC cluster'. Together they form a unique fingerprint.

Cite this