How to Train Your Radiology AI
Artificial intelligence algorithms may revolutionize medical image analysis -- but only if researchers can get enough data to train them.
While the last few decades have seen remarkable strides in the science and technology of medical imaging, the individuals on the front line -- namely, radiologists -- are experiencing higher levels of professional burnout than the general population or the average medical professional. A key contributor has been the dramatic increase in radiologists' workload. The amount of imaging data has increased disproportionately compared to the number of available trained readers, with sometimes hundreds of images taken for a single patient's disease or injury.
A potential solution to the radiologists' growing burden lies with the rapidly advancing field of artificial intelligence. Many researchers are now working on applying AI algorithms to medical image analysis, which could boost efficiency and accuracy in clinical care. The radiologist of the future could let computers extract information from images, while he or she manages that information in the clinical context of the patient.
However, there is a significant challenge to developing such algorithms. Most require large amounts of annotated medical images -- referred to as “ground-truth data” -- in order to “learn” what injury or disease features look like. For instance, nonmedical image recognition algorithms often train on millions of images to understand what features trees or dogs have in common. In the case of radiological images, there is no aggregated repository of annotated datasets, and issues such as patient privacy and high cost prevent such data from being readily available to AI researchers.
“It's time-consuming to accurately curate, label and anonymize the images, and there are also cultural reasons why some people don't do it,” said Ronald Summers, chief of the Clinical Image Processing Service at the National Institutes of Health Clinical Center. “They may wish to retain the data set for their own private use or believe it has intrinsic economic value that leads them to hesitate to make it available for free.”
To address these challenges, Summers and other researchers are actively working on creative ways around the shortage of training data for AI models. A newer method involves generating synthetic data, including photorealistic medical images created by simulations of light passing through anatomically accurate models. A more human-centric plan of attack involves adding to the collection of labeled data with nonexpert crowdsourcing -- in other words, asking the public to help annotate and segment images. Others have mined information from radiology reports to label the massive amount of unannotated scans housed in hospital archives.
While each method has its own advantages and disadvantages, they collectively bring AI systems one step closer to becoming a reality in the clinic. More training data means better algorithms, which have the ability to contribute to a more accurate and efficient workflow for radiologists.
“I think AI and deep learning will become essential to radiology, leading to improvements in patient care and changing some of the ways that radiologists practice,” said Summers. “I think approaches like synthetic data and crowdsourcing are very exciting areas of active research in which we’re beginning to explore the possibilities.”
The method of generating synthetic medical images began as an attempt to improve screening technology for colorectal cancer. Colonoscopes have a monocular camera with light sources a short distance away, giving the device a wide field of view that only produces 2D images. Gastroenterologists miss more than 20 percent of clinically relevant polyps, in part due to this lack of 3D data. Faisal Mahmood, a postdoctoral fellow in biomedical engineering at Johns Hopkins University, believed he could use AI methods to estimate depth and tissue topography from conventional 2D colonoscopy images.
“Endoscopy images are typically 2D of the colon or esophagus, but there is a lot of topographical 3D information on the [inside of these structures]” Mahmood said. “If we can transfer from 2D to 3D images with existing endoscopes and without any hardware changes, that has a lot of value.”
Mahmood thought about employing an AI approach called a convolutional neural network, or CNN, which is particularly good at image recognition tasks. A CNN was ideal to estimate depth from a 2D colonoscopy image, but Mahmood lacked ground-truth depth data to train the system. Depth sensors would be impractical to couple to a tiny endoscope, and they would require regulatory approval for use in humans.
To get around this hurdle, he trained the CNN on a data set made up of over 200,000 synthetically generated endoscopy images. With computer graphics software, Mahmood created a virtual endoscope camera model and a virtual colon reconstructed from CT scans of a silicone colon phantom. The synthetic images were generated from the interaction between the virtual endoscope and the virtual colon, as the endoscope was made to randomly traverse through the colon. After training the CNN on these images, he then tested the model for accuracy using real endoscopy images from a pig colon.
“If you have an anatomically correct model of the organ that you are imaging, you can start generating a lot of synthetic data and model the diversity that doesn’t exist in real data,” said Mahmood. “We’re hoping that synthetic data becomes a movement in the field.”
Despite some success, Mahmood found himself dissatisfied with this particular approach to synthetic data generation, as it failed to produce the realistic, diverse examples necessary for training deep learning models. As a next iteration, his latest work combines deep learning with a newly developed visualization technique called cinematic rendering to create photorealistic medical images. Cinematic rendering works by simulating the propagation of light through tissue models reconstructed from CT images.
Mahmood took this new method of synthetic data generation and applied it to the same depth estimation problem for colonoscopy images. He created a wide range of healthy to pathologic colon tissue images and again validated the model with a pig colon experiment. He also prevented the model from learning patient-specific features by providing a variety of renderings assigned to the same depth value. After being trained on this data set, the deep learning model was able to accurately estimate depth when put to the test on real tissue.
“Synthetic data is a very hot topic. My group is also exploring this approach, and it's very interesting,” said Summers. “I think it will have limitations because synthetic data do not reflect reality, but there may be benefits in using it for data augmentation -- in other words, supplementing the data set to give learning algorithms a sense of the expected range of variation that we might see in reality.”
Radiology by crowds
An alternative solution to the lack of ground-truth data relies on the strength of a crowd. Crowdsourcing solicits contributions from a large group of people -- typically, members of the general public -- to solve an unwieldy problem. As an everyday real-world example, the community-based GPS navigation app Waze crowdsources real-time information about traffic patterns in order to give better directions. Research teams have experimented with crowdsourcing platforms to annotate virtual colonoscopy videos, endoscopic images, lung CT scans and breast cancer histology images.
One recent study employed the Amazon Mechanical Turk platform, a crowdsourcing internet marketplace that pays workers to perform small tasks, to identify polyp candidates and polyp-free regions within a virtual colonoscopy – basically, a video fly-through of a 3D colon reconstructed from CT scans. Even though the participants didn’t necessarily have training in radiology, as a group they were able to achieve a sensitivity of 80.0 percent and specificity of 86.5 percent in terms of picking out video segments that contained a clinically proven polyp. However, a virtual colonoscopy-trained radiologist still managed to outperform the crowd, with a sensitivity of 86.7 percent and specificity of 87.2 percent.
The authors suggest that a radiologist could potentially fast-forward through the regions deemed polyp-free by the crowd with high confidence, allowing for significant time savings and enabling more examinations to be performed. In addition, this newly labeled data could then train deep learning models, since crowdsourced data sets can be generated quickly and for fairly low cost.
“Interpretation of virtual colonoscopy requires significant radiologist intervention and time, even though the majority of false positive candidates can be rejected by individuals with even minimal training,” said study author Ji Hwan Park, a doctoral candidate at Stony Brook University. “We believe that the crowd results could be used in the future to assist radiologists in virtual colonoscopy screenings directly, or in combination with computer-aided detection algorithms.”
But what if members of the crowd disagree with one another? Such “noise” generated by crowdsourcing is inevitable, and inputting inconsistent training data could negatively affect the final result. A group at Technische Universität München has overcome this issue by augmenting and retraining a deep learning model with annotations from the crowd to identify signs of mitosis in breast cancer histology images.
“We handed these images not to the experts but to people in the crowd who might have the expertise and might not. How are they going to do the task? The outcome might be horrible, or you might get noisy annotations,” said study author Shadi Albarqouni, a postdoctoral research associate at Technische Universität München. “This is the research question that we raised. Can we handle these noisy votes? And can we train or fine-tune the deep learning model from these crowd votes?”
Albarqouni first trained his deep learning model, called AggNet, on several images labeled by pathologists as either having cells with signs of mitosis (called mitotic figures) or not. Next, he gave AggNet new unlabeled data, and the model cropped areas of each image that it deemed more than 90 percent likely to contain mitotic figures. These candidates were then passed to the crowdsourcing platform CrowdFlower, where at least 10 people voted on whether or not they contained mitotic figures.
The CrowdFlower recruits were given a brief training on what mitotic figures looked like, with example images. Then, participants had to complete several test questions for quality control purposes, where they were presented with images with known labels given by pathologists. The experiment had more than 100 participants from different countries and backgrounds who received a small payment of a few cents per image annotation.
AggNet not only takes the votes into account, but also the trustworthiness of the voter -- in other words, their accuracy in performing the task at hand. Each annotator received an accuracy rating based on the quality control test in the beginning of the task. If they received higher than 70 percent accuracy, their annotations would be allowed into the study.
The researchers also recorded each voter's sensitivity and specificity in detecting mitotic figures based on the quality control questions, using them to apply weights to their votes. In the end, AggNet takes into account its own prediction as well as these weighted votes from the crowd to come up with aggregated -- and more accurate -- labels.
“Crowdsourcing is very good at providing useful human input where the main requirement is human perception -- for example, a structure or an edge in an image,” Summers said. “There needs to be some very well-focused, well-circumscribed task that you can train a lay person to accomplish, which indeed describes many perceptual tasks in biomedical image analysis.”
Mining messy archives
Another approach involves taking advantage of a vast and largely untapped data resource: the hospital's picture archiving and communication system (PACS). The medical imaging technology houses patient images and accompanying radiological reports, markings and measurements -- but the data are typically unorganized and scattered. The information is not yet in a form that can be used for AI research and development.
In the past five years, Summers and his colleagues successfully used such data to create large-scale annotated image data sets publicly available for research purposes. For instance, a chest X-ray database released by his group in 2017 called ChestX-ray8 contains 108,948 labeled images from 32,717 unique patients, including many with advanced lung disease. Earlier this year, Summers introduced DeepLesion, a comprehensive dataset of 32,735 lesions from 32,120 axial CT slices of 4,427 patients annotated with type, location and size.
To transform the PACS data into something usable, Summers and his colleagues use an algorithm to pick out relevant parts within the radiology report, analyze the information and subsequently label the scan. For example, the annotation of ChestX-ray8 involved several natural language processing techniques, which converted unstructured text into a structured form to enable automatic identification and extraction of information. Reports were mined for words related to disease, such as “mass” or “pneumonia.”
Other groups have done similar work with natural language processing for radiology reports. A 2016 review paper on the topic identified 67 relevant publications describing natural language processing methods that support practical applications in radiology. Many of these studies demonstrated excellent performance. More recently, a study published earlier this year tested natural language processing-based AI models to identify findings in radiology reports for 96,303 head CT scans. The best model achieved a sensitivity and specificity for all labels of 90.25 percent and 91.72 percent, respectively.
A new role for radiologists
Each approach to increasing the amount of publicly available ground-truth data in medical imaging has its own set of pros and cons. For instance, synthetic data cannot reflect all the nuances of real data -- but it could prove useful for rare conditions, where large amounts of real data simply do not exist. Crowdsourcing and natural language processing methods have the advantage of real data, but Mahmood notes that there is a limit to the amount of real data out there in the world. His lab originally decided to create synthetic data because they had only 2D colonoscopy data with no information on depth.
As for the question of when these AI algorithms will finally arrive at the clinic, no one really knows. The anticipation around AI in medicine and computer-aided detection keeps growing, but Summers says it is difficult to project a timeline.
He is willing to make one prediction, though: AI will not make radiology as a career obsolete, as some may fear. Instead, it could redefine the role of the radiologist to one of an information specialist that leaves the more tedious work to computers -- perhaps counteracting high rates of professional burnout in the field.
“I see [AI] as a partner in improving the care of patients and helping radiologists integrate large amounts of data,” he said. “I’m very optimistic that it will lead to benefits all around.”