|
|
Automated classification of ICD-O-3 morphology code from pathology reports using text-mining and support vector machine |
PAN Jin, GONG Weiwei, FEI Fangrong, WANG Meng, ZHOU Xiaoyan, HU Ruying, ZHONG Jieming
|
Department of Non-communicable Disease Control and Prevention, Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, Zhejiang 310051, China |
|
|
Abstract Objective To evaluate the accuracy of automated classification of ICD-O-3 morphology code from pathology reports by text-mining and support vector machine ( SVM ) , in order to provide basis for automated tumor coding in Chinese. Methods The tumor report cards of Zhejiang residents from 2017 to 2019 were collected from Chronic Disease Surveillance Information Management System of Zhejiang Province. According to ICD-O-3, the keywords of the pathology reports were extracted, and SVM was used for automatic classification. The classification results were compared with those of 16 professionals with more than two years of experience in tumor coding, and the accuracy rate, recall rate and F-score were calculated for effect evaluation. Results Totally 83 082 cases from 2017 to 2019 were included and were categorized into 17 morphological classifications, with 52 877 ( 63.65% ) cases of adenocarcinoma, squamous carcinoma and transitional cell carcinoma. A total of 1 090 keywords were enrolled into main corpus. The total F-score, accuracy rate and recall rate are 85.69, 77.20% and 96.27%, respectively. Conclusion Text-mining combined with SVM can improve the efficiency of ICD-O-3 morphology coding; however, the accuracy needs to be further improved.
|
Received: 09 June 2020
Revised: 21 December 2020
Published: 16 March 2021
|
|
|
|
|
[1] FITZMAURICE C,ALLEN C,BARBER R M,et al.Global,regional,and national cancer incidence,mortality, Years of life lost,years lived with disability, and disability-adjusted life-years for 32 cancer groups,1990 to 2015:a systematic analysis for the global burden of disease study[J] .JAMA Oncol,2017,3(4):524-548. [2] 魏矿荣,梁智恒,刘静.肿瘤登记软件和商业智能在肿瘤登记中的应用[J] .中国肿瘤,2012,21(7):484-487. [3] 秦瑞,方乐,俞敏.文本分析方法在医学研究中的应用进展[J] .浙江预防医学,2015,27(10):1008-1011. [4] JOUHET V,DEFOSSEZ G,BURGUN A,et al.Automated classification of free-text pathology reports for registration of incident cases of cancer[J] .Methods Inf Med,2012,51(3):242-251. [5] ALAWAD M,GAO S,QIU J X,et al.Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks[J] .JAMA,2020,27(1):89-98. [6] OLEYNIK M,PATRAO D F C,Finger M.Automated classification of semi-structured pathology reports into ICD-O using SVM in Portuguese[J] .Stud Health Technol InForm,2017,235:256-260. [7] 潘劲,胡如英,俞敏,等.浙江省慢性病监测信息管理系统的架构及作用[J] .中国预防医学杂志,2010,11(11):1156-1157. [8] TARONE R E.Conflicts of interest, bias, and the IARC monographs program[J] .Regul Toxicol Pharmacol,2018,98:A1-A4. [9] 杜灵彬,毛伟敏,李辉章,等.浙江省肿瘤登记膀胱癌发病及死亡特征分析[J] .浙江预防医学,2014,26(5):473-476. [10] BERG J W.Morphologic classification of human cancer[M] //SHOTTENFELd D F J,Jr.Cancer epidemiology and prevention. 2nd ed. New York: OxFord University Press,1996. [11] 王庆,陈泽亚,郭静,等.基于词共现矩阵的项目关键词词库和关键词语义网络[J] .计算机应用,2015,35(6):1649-1653. [12] KWON O S,KIM J,CHOI K H,et al.Trends in deqi research: a text mining and network analysis[J] . Integr Med Res,2018,7(3):231-237. [13] HUANG S,CAI N,PACHECO P P,et al.Applications of support vector machine (SVM) learning in cancer genomics[J] .Cancer Genomics Proteomics,2018,15(1):41-51. [14] 宁温馨,于明.基于语义相似度计算的临床诊断自动编码算法研究[J] .医学信息学杂志,2016,37(2):52-56. [15] 李凯. 中文文本分类方法研究[J] .电脑知识与技术,2019,15(4):242-244. [16] 段旭磊,张仰森,孙秭卓.微博文本的句向量表示及相似度计算方法研究[J] .计算机工程,2017,43(5):143-148. [17] 陈建国,朱健.肿瘤登记编码审核中的常见问题及处理[J] .中国肿瘤,2012,21(7):502-506. [18] QIU J X,YOON H J,FEARN P A,et al.Deep learning for automated extraction of primary sites from cancer pathology reports[J] .IEEE J Biomed Health Inform,2018,22(1):244-251. [19] GAO S,YOUNG M T,QIU J X,et al.Hierarchical attention networks for information extraction from cancer pathology reports[J] .J Am Med Inform Assoc,2018,25(3):321-330. [20] 郭长满,郭敏,刘媛媛,等.机器学习算法在预测男男性行为人群中HIV感染的应用[J] .中国卫生统计,2019,36(1):28-31,35. |
[1] |
. [J]. Preventive Medicine, 2021, 33(12): 1253-1255. |
|
|
|
|