Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

Post written by Sobia Nasir Laique, MD, and Umar Hayat, MD, from the Division of Gastroenterology and Hepatology, Mayo Clinic, Phoenix, Arizona, and Division of Gastroenterology, University of Minnesota, Minneapolis, Minnesota, USA.


The focus of our study was to demonstrate the use of a new hybrid approach using natural language processing of charts that have been elucidated with optical character recognition processing (OCR/NLP hybrid) to obtain relevant clinical information from scanned colonoscopy and pathology reports, a technology co-developed by Cleveland Clinic and eHealth Technologies (West Henrietta, NY, USA).

Colonoscopies are routinely performed for colorectal cancer screening in the United States. Reports are often generated in a non-standardized format and are not always integrated into electronic health records. Thus, this information is not readily available for streamlining quality management, participating in endoscopy registries, or reporting of patient- and center-specific risk factors predictive of outcomes. The OCR/NLP hybrid would allow for that to happen.

Compared with manual data extraction, the accuracy of the hybrid OCR/NLP approach to detect polyps was 95.8%, adenomas 98.5%, sessile serrated polyps 99.3%, advanced adenomas 98%, inadequate bowel preparation 98.4%, and failed cecal intubation 99%. Comparison of the dataset collected via NLP alone with that collected using the hybrid OCR/NLP approach showed that the accuracy for almost all variables was >99%.

The results of this proof-of-concept study create a new frontier in the use of large-scale data extraction from scanned reports, which was previously limited by lack of appropriate technology. The process was previously expensive and time consuming but can now potentially be done accurately in a time- and labor-efficient manner. Future multicenter studies elaborating the use of OCR in combination with validated commercially available NLP tools will help substantiate the use of this novel technology on a larger scale, not only for measurement of procedure quality indicators but possibly also for multiple other venues in healthcare.


Figure 1. Reporting of colonoscopy quality parameters by each method: manual review, natural language processing (NLP) alone, natural language processing/optical character recognition (OCR) hybrid approach.

Read the full article online.

The information presented in Endoscopedia reflects the opinions of the authors and does not represent the position of the American Society for Gastrointestinal Endoscopy (ASGE). ASGE expressly disclaims any warranties or guarantees, expressed or implied, and is not liable for damages of any kind in connection with the material, information, or procedures set forth.

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s