TY - JOUR
T1 - Deep Learning-Based Classification of CRISPR Loci Using Repeat Sequences
AU - Liao, Xingyu
AU - Li, Yanyan
AU - Wu, Yingfu
AU - Li, Xingyi
AU - Shang, Xuequn
N1 - Publisher Copyright:
© 2025 American Chemical Society.
PY - 2025
Y1 - 2025
N2 - With the widespread application of the CRISPR-Cas system in gene editing and related fields, along with the increasing availability of metagenomic data, the demand for detecting and classifying CRISPR-Cas systems in metagenomic data sets has grown significantly. Traditional classification methods for CRISPR-Cas systems primarily rely on identifying cas genes near CRISPR arrays. However, in cases where cas gene information is absent, such as in metagenomes or fragmented genome assemblies, traditional methods may fail. Here, we present a deep learning-based method, CRISPRclassify-CNN-Att, which classifies CRISPR loci solely based on repeat sequences. CRISPRclassify-CNN-Att utilizes convolutional neural networks (CNNs) and self-attention mechanisms to extract features from repeat sequences. It employs a stacking strategy to address the imbalance of samples across different subtypes and uses transfer learning to improve classification accuracy for subtypes with fewer samples. CRISPRclassify-CNN-Att demonstrates outstanding performance in classifying multiple subtypes, particularly those with larger sample sizes. Although CRISPR loci classification traditionally depends on cas genes, CRISPRclassify-CNN-Att offers a novel approach that serves as a significant complement to cas-based methods, enabling the classification of orphan or distant CRISPR loci. The proposed tool is freely accessible via https://github.com/Xingyu-Liao/CRISPRclassify-CNN-Att.
AB - With the widespread application of the CRISPR-Cas system in gene editing and related fields, along with the increasing availability of metagenomic data, the demand for detecting and classifying CRISPR-Cas systems in metagenomic data sets has grown significantly. Traditional classification methods for CRISPR-Cas systems primarily rely on identifying cas genes near CRISPR arrays. However, in cases where cas gene information is absent, such as in metagenomes or fragmented genome assemblies, traditional methods may fail. Here, we present a deep learning-based method, CRISPRclassify-CNN-Att, which classifies CRISPR loci solely based on repeat sequences. CRISPRclassify-CNN-Att utilizes convolutional neural networks (CNNs) and self-attention mechanisms to extract features from repeat sequences. It employs a stacking strategy to address the imbalance of samples across different subtypes and uses transfer learning to improve classification accuracy for subtypes with fewer samples. CRISPRclassify-CNN-Att demonstrates outstanding performance in classifying multiple subtypes, particularly those with larger sample sizes. Although CRISPR loci classification traditionally depends on cas genes, CRISPRclassify-CNN-Att offers a novel approach that serves as a significant complement to cas-based methods, enabling the classification of orphan or distant CRISPR loci. The proposed tool is freely accessible via https://github.com/Xingyu-Liao/CRISPRclassify-CNN-Att.
KW - CRISPR loci classification
KW - CRISPR-Cas system
KW - deep learning
KW - repeat sequences
KW - self-attention mechanisms
UR - http://www.scopus.com/inward/record.url?scp=105003488298&partnerID=8YFLogxK
U2 - 10.1021/acssynbio.5c00174
DO - 10.1021/acssynbio.5c00174
M3 - 文章
AN - SCOPUS:105003488298
SN - 2161-5063
JO - ACS Synthetic Biology
JF - ACS Synthetic Biology
ER -