TY - JOUR
T1 - Machine Learning Models for Chlorophyll-a Forecasting in a Freshwater Lake
T2 - Case Study of Lake Taihu
AU - Sun, Guojin
AU - Zhu, Weitang
AU - Qian, Xiaoyan
AU - Wei, Chunlei
AU - Xie, Pengfei
AU - Shi, Yao
AU - Cao, Xiaoyong
AU - He, Yi
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/4
Y1 - 2025/4
N2 - Cyanobacteria harmful blooms (Cyano-HABs) have become a globally critical environmental issue, threatening freshwater ecosystems by degrading water quality and posing risks to human and aquatic life. Chlorophyll-a (Chl-a), a key biomarker of bloom intensity, offers crucial insights into algal bloom dynamics. However, predicting Chl-a concentrations remains challenging due to the complex interactions between various environmental factors. This study utilizes machine learning (ML) models to predict Chl-a concentrations, focusing on Lake Taihu in China, a large eutrophic lake that serves as an example of numerous freshwater lakes suffering from Cyano-HABs. The research leverages nine critical water quality parameters—water temperature, pH, dissolved oxygen, turbidity, electrical conductivity permanganate index, ammonia nitrogen, total phosphorus, and total nitrogen—to develop an ensemble ML model using XGBoost, known for its ability to handle nonlinear relationships and integrate multiple variables. The XGBoost model achieved superior predictive accuracy with an R2 value of 0.78 and RMSE of 8.97 mg/m3 on the test set, outperforming traditional models like linear regression, decision trees, multi-layer perceptrons, support vector regression, and random forests. Feature importance analysis identified electrical conductivity, turbidity, and water temperature as the most significant predictors of Chl-a levels. This study further enhances model interpretability through Pearson correlation analysis, which quantifies the relationships between Chl-a concentrations and other water quality factors. Additionally, we employed principal component analysis (PCA), mutual information, Spearman rank correlation coefficients, and SHAP models to analyze feature importance and model interpretability in ML. The model’s robustness was tested across multiple monitoring sites in Lake Taihu, demonstrating its potential for broader application in other eutrophic lakes facing similar environmental challenges. By providing a reliable tool for forecasting Chl-a concentrations, this research contributes to the development of early warning systems that can help mitigate the impacts of Cyano-HABs, aiding in more effective water resource management.
AB - Cyanobacteria harmful blooms (Cyano-HABs) have become a globally critical environmental issue, threatening freshwater ecosystems by degrading water quality and posing risks to human and aquatic life. Chlorophyll-a (Chl-a), a key biomarker of bloom intensity, offers crucial insights into algal bloom dynamics. However, predicting Chl-a concentrations remains challenging due to the complex interactions between various environmental factors. This study utilizes machine learning (ML) models to predict Chl-a concentrations, focusing on Lake Taihu in China, a large eutrophic lake that serves as an example of numerous freshwater lakes suffering from Cyano-HABs. The research leverages nine critical water quality parameters—water temperature, pH, dissolved oxygen, turbidity, electrical conductivity permanganate index, ammonia nitrogen, total phosphorus, and total nitrogen—to develop an ensemble ML model using XGBoost, known for its ability to handle nonlinear relationships and integrate multiple variables. The XGBoost model achieved superior predictive accuracy with an R2 value of 0.78 and RMSE of 8.97 mg/m3 on the test set, outperforming traditional models like linear regression, decision trees, multi-layer perceptrons, support vector regression, and random forests. Feature importance analysis identified electrical conductivity, turbidity, and water temperature as the most significant predictors of Chl-a levels. This study further enhances model interpretability through Pearson correlation analysis, which quantifies the relationships between Chl-a concentrations and other water quality factors. Additionally, we employed principal component analysis (PCA), mutual information, Spearman rank correlation coefficients, and SHAP models to analyze feature importance and model interpretability in ML. The model’s robustness was tested across multiple monitoring sites in Lake Taihu, demonstrating its potential for broader application in other eutrophic lakes facing similar environmental challenges. By providing a reliable tool for forecasting Chl-a concentrations, this research contributes to the development of early warning systems that can help mitigate the impacts of Cyano-HABs, aiding in more effective water resource management.
KW - chlorophyll-a
KW - Lake Taihu
KW - machine learning
KW - XGBoost
UR - http://www.scopus.com/inward/record.url?scp=105003656271&partnerID=8YFLogxK
U2 - 10.3390/w17081219
DO - 10.3390/w17081219
M3 - 文章
AN - SCOPUS:105003656271
SN - 2073-4441
VL - 17
JO - Water (Switzerland)
JF - Water (Switzerland)
IS - 8
M1 - 1219
ER -