Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

Pedro R.A.S. Bassi; Wenxuan Li; Yucheng Tang; Fabian Isensee; Zifu Wang; Jieneng Chen; Yu Cheng Chou; Saikat Roy; Yannick Kirchhoff; Maximilian Rokuss; Ziyan Huang; Jin Ye; Junjun He; Tassilo Wald; Constantin Ulrich; Michael Baumgartner; Klaus H. Maier-Hein; Paul Jaeger; Yiwen Ye; Yutong Xie; Jianpeng Zhang; Ziyang Chen; Yong Xia; Zhaohu Xing; Lei Zhu; Yousef Sadegheih; Afshin Bozorgpour; Pratibha Kumari; Reza Azad; Dorit Merhof; Pengcheng Shi; Ting Ma; Yuxin Du; Fan Bai; Tiejun Huang; Bo Zhao; Haonan Wang; Xiaomeng Li; Hanxue Gu; Haoyu Dong; Jichen Yang; Maciej A. Mazurowski; Saumya Gupta; Linshan Wu; Jiaxin Zhuang; Hao Chen; Holger Roth; Daguang Xu; Matthew B. Blaschko; Sergio Decherchi; Andrea Cavalli; Alan L. Yuille; Zongwei Zhou

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

Pedro R.A.S. Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu Wang, Jieneng Chen, Yu Cheng Chou, Saikat Roy, Yannick Kirchhoff, Maximilian Rokuss, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Klaus H. Maier-Hein, Paul Jaeger, Yiwen Ye, Yutong XieJianpeng Zhang, Ziyang Chen, Yong Xia, Zhaohu Xing, Lei Zhu, Yousef Sadegheih, Afshin Bozorgpour, Pratibha Kumari, Reza Azad, Dorit Merhof, Pengcheng Shi, Ting Ma, Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao, Haonan Wang, Xiaomeng Li, Hanxue Gu, Haoyu Dong, Jichen Yang, Maciej A. Mazurowski, Saumya Gupta, Linshan Wu, Jiaxin Zhuang, Hao Chen, Holger Roth, Daguang Xu, Matthew B. Blaschko, Sergio Decherchi, Andrea Cavalli, Alan L. Yuille, Zongwei Zhou

计算机学院

科研成果: 期刊稿件 › 会议文章 › 同行评审

1 引用（Scopus）

摘要

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms. In addition, we also evaluated pre-existing AI frameworks-which, differing from algorithms, are more flexible and can support different algorithms-including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

源语言	英语
期刊	Advances in Neural Information Processing Systems
卷	37
出版状态	已出版 - 2024
活动	38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, 加拿大期限: 9 12月 2024 → 15 12月 2024

其它文件与链接

链接到 Scopus 的出版物

引用此

Bassi, P. R. A. S., Li, W., Tang, Y., Isensee, F., Wang, Z., Chen, J., Chou, Y. C., Roy, S., Kirchhoff, Y., Rokuss, M., Huang, Z., Ye, J., He, J., Wald, T., Ulrich, C., Baumgartner, M., Maier-Hein, K. H., Jaeger, P., Ye, Y., ... Zhou, Z. (2024). Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? Advances in Neural Information Processing Systems, 37.

@article{fd6029ffe9ea44c5bac8b6f0ed01eab1,

title = "Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?",

abstract = "How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms. In addition, we also evaluated pre-existing AI frameworks-which, differing from algorithms, are more flexible and can support different algorithms-including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.",

author = "Bassi, {Pedro R.A.S.} and Wenxuan Li and Yucheng Tang and Fabian Isensee and Zifu Wang and Jieneng Chen and Chou, {Yu Cheng} and Saikat Roy and Yannick Kirchhoff and Maximilian Rokuss and Ziyan Huang and Jin Ye and Junjun He and Tassilo Wald and Constantin Ulrich and Michael Baumgartner and Maier-Hein, {Klaus H.} and Paul Jaeger and Yiwen Ye and Yutong Xie and Jianpeng Zhang and Ziyang Chen and Yong Xia and Zhaohu Xing and Lei Zhu and Yousef Sadegheih and Afshin Bozorgpour and Pratibha Kumari and Reza Azad and Dorit Merhof and Pengcheng Shi and Ting Ma and Yuxin Du and Fan Bai and Tiejun Huang and Bo Zhao and Haonan Wang and Xiaomeng Li and Hanxue Gu and Haoyu Dong and Jichen Yang and Mazurowski, {Maciej A.} and Saumya Gupta and Linshan Wu and Jiaxin Zhuang and Hao Chen and Holger Roth and Daguang Xu and Blaschko, {Matthew B.} and Sergio Decherchi and Andrea Cavalli and Yuille, {Alan L.} and Zongwei Zhou",

note = "Publisher Copyright: {\textcopyright} 2024 Neural information processing systems foundation. All rights reserved.; 38th Conference on Neural Information Processing Systems, NeurIPS 2024 ; Conference date: 09-12-2024 Through 15-12-2024",

year = "2024",

language = "英语",

volume = "37",

journal = "Advances in Neural Information Processing Systems",

issn = "1049-5258",

publisher = "Neural information processing systems foundation",

}

Bassi, PRAS, Li, W, Tang, Y, Isensee, F, Wang, Z, Chen, J, Chou, YC, Roy, S, Kirchhoff, Y, Rokuss, M, Huang, Z, Ye, J, He, J, Wald, T, Ulrich, C, Baumgartner, M, Maier-Hein, KH, Jaeger, P, Ye, Y, Xie, Y, Zhang, J, Chen, Z, Xia, Y, Xing, Z, Zhu, L, Sadegheih, Y, Bozorgpour, A, Kumari, P, Azad, R, Merhof, D, Shi, P, Ma, T, Du, Y, Bai, F, Huang, T, Zhao, B, Wang, H, Li, X, Gu, H, Dong, H, Yang, J, Mazurowski, MA, Gupta, S, Wu, L, Zhuang, J, Chen, H, Roth, H, Xu, D, Blaschko, MB, Decherchi, S, Cavalli, A, Yuille, AL & Zhou, Z 2024, 'Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?', Advances in Neural Information Processing Systems, 卷 37.

TY - JOUR

T1 - Touchstone Benchmark

T2 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024

AU - Bassi, Pedro R.A.S.

AU - Li, Wenxuan

AU - Tang, Yucheng

AU - Isensee, Fabian

AU - Wang, Zifu

AU - Chen, Jieneng

AU - Chou, Yu Cheng

AU - Roy, Saikat

AU - Kirchhoff, Yannick

AU - Rokuss, Maximilian

AU - Huang, Ziyan

AU - Ye, Jin

AU - He, Junjun

AU - Wald, Tassilo

AU - Ulrich, Constantin

AU - Baumgartner, Michael

AU - Maier-Hein, Klaus H.

AU - Jaeger, Paul

AU - Ye, Yiwen

AU - Xie, Yutong

AU - Zhang, Jianpeng

AU - Chen, Ziyang

AU - Xia, Yong

AU - Xing, Zhaohu

AU - Zhu, Lei

AU - Sadegheih, Yousef

AU - Bozorgpour, Afshin

AU - Kumari, Pratibha

AU - Azad, Reza

AU - Merhof, Dorit

AU - Shi, Pengcheng

AU - Ma, Ting

AU - Du, Yuxin

AU - Bai, Fan

AU - Huang, Tiejun

AU - Zhao, Bo

AU - Wang, Haonan

AU - Li, Xiaomeng

AU - Gu, Hanxue

AU - Dong, Haoyu

AU - Yang, Jichen

AU - Mazurowski, Maciej A.

AU - Gupta, Saumya

AU - Wu, Linshan

AU - Zhuang, Jiaxin

AU - Chen, Hao

AU - Roth, Holger

AU - Xu, Daguang

AU - Blaschko, Matthew B.

AU - Decherchi, Sergio

AU - Cavalli, Andrea

AU - Yuille, Alan L.

AU - Zhou, Zongwei

PY - 2024

Y1 - 2024

N2 - How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms. In addition, we also evaluated pre-existing AI frameworks-which, differing from algorithms, are more flexible and can support different algorithms-including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

AB - How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms. In addition, we also evaluated pre-existing AI frameworks-which, differing from algorithms, are more flexible and can support different algorithms-including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

UR - http://www.scopus.com/inward/record.url?scp=105000473886&partnerID=8YFLogxK

M3 - 会议文章

AN - SCOPUS:105000473886

SN - 1049-5258

VL - 37

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

Y2 - 9 December 2024 through 15 December 2024

ER -

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

摘要

其它文件与链接

指纹

引用此