Works in Progress is becoming a print magazine. Our first print issue, Issue 21, will land in November. If you live in the United States or the United Kingdom, you can subscribe here. If you live outside the US or UK and want to be notified as soon as subscriptions are live in your country, leave your details here. CheXNet can detect pneumonia with greater accuracy than a panel of board-certified radiologists. It is an AI model released in 2017, trained on more than 100,000 chest X-rays. It is fast, free, and can run on a single consumer-grade GPU. A hospital can use it to classify a new scan in under a second. Since then, companies like Annalise.ai, Lunit, Aidoc, and Qure.ai have released models that can detect hundreds of diseases across multiple types of scans with greater accuracy and speed than human radiologists in benchmark tests. Some products can reorder radiologist worklists to prioritize critical cases, suggest next steps for care teams, or generate structured draft reports that fit into hospital record systems. A few, like IDx-DR, are even cleared to operate without a physician reading the image at all. In total, there are over 700 FDA-cleared radiology models, which account for roughly three-quarters of all medical AI devices. Radiology is a field optimized for human replacement, where digital inputs, pattern recognition tasks, and clear benchmarks predominate. In 2016, Geoffrey Hinton – computer scientist and Turing Award winner – declared that ‘people should stop training radiologists now’. If the most extreme predictions about the effect of AI on employment and wages were true, then radiology should be the canary in the coal mine. But demand for human labor is higher than ever. In 2025, American diagnostic radiology residency programs offered a record 1,208 positions across all radiology specialties, a four percent increase from 2024, and the field’s vacancy rates are at all-time highs. In 2025, radiology was the second-highest-paid medical specialty in the country, with an average income of $520,000, over 48 percent higher than the average salary in 2015. Three things explain this. First, while models beat humans on benchmarks, the standardized tests designed to measure AI performance, they struggle to replicate this performance in hospital conditions. Most tools can only diagnose abnormalities that are common in training data, and models often don’t work as well outside of their test conditions. Second, attempts to give models more tasks have run into legal hurdles: regulators and medical insurers so far are reluctant to approve or cover fully autonomous radiology models. Third, even when they do diagnose accurately, models replace only a small share of a radiologist’s job. Human radiologists spend a minority of their time on diagnostics and the majority on other activities, like talking to patients and fellow clinicians. Artificial intelligence is rapidly spreading across the economy and society. But radiology shows us that it will not necessarily dominate every field in its first years of diffusion — at least until these common hurdles are overcome. Exploiting all of its benefits will involve adapting it to society, and society’s rules to it. Islands of automation All AIs are functions or algorithms, called models, that take in inputs and spit out outputs. Radiology models are trained to detect a finding, which is a measurable piece of evidence that helps identify or rule out a disease or condition. Most radiology models detect a single finding or condition in one type of image. For example, a model might look at a chest CT and answer whether there are lung nodules, rib fractures, or what the coronary arterial calcium score is. For every individual question, a new model is required. In order to cover even a modest slice of what they see in a day, a radiologist would need to switch between dozens of models and ask the right questions of each one. Several platforms manage, run, and interpret outputs from dozens or even hundreds of separate AI models across vendors, but each model operates independently, analyzing for one finding or disease at a time. The final output is a list of separate answers to specific questions, rather than a single description of an image. Even with hundreds of imaging algorithms approved by the Food and Drug Administration (FDA) on the market, the combined footprint of today’s radiology AI models still cover only a small fraction of real-world imaging tasks. Many cluster around a few use cases: stroke, breast cancer, and lung cancer together account for about 60 percent of models, but only a minority of the actual radiology imaging volume that is carried out in the US. Other subspecialties, such as vascular, head and neck, spine, and thyroid imaging currently have relatively few AI products. This is in part due to data availability: the scan needs to be common enough for there to be many annotated examples that can be used to train models. Some scans are also inherently more complicated than others. For example, ultrasounds are taken from multiple angles and do not have standard imaging planes, unlike X-rays. Once deployed outside of the hospital where they were initially trained, models can struggle. In a standard clinical trial, samples are taken from multiple hospitals to ensure exposure to a broad range of patients and to avoid site-specific effects, such as a single doctor’s technique or how a hospital chooses to calibrate its diagnostic equipment. But when an algorithm is undergoing regulatory approval in the US, its developers will normally test it on a relatively narrow dataset. Out of the models in 2024 that reported the number of sites where they were tested, 38 percent were tested on data from a single hospital. Public benchmarks tend to rely on multiple datasets from the same hospital. The performance of a tool can drop as much as 20 percentage points when it is tested out of sample, on data from other hospitals. In one study, a pneumonia detection model trained on chest X-rays from a single hospital performed substantially worse when tested at a different hospital. Some of these challenges stemmed from avoidable experimental issues like overfitting, but others are indicative of deeper problems like differences in how hospitals record and generate data, such as using slightly different imaging equipment. This means that individual hospitals or departments would need to retrain or revalidate today’s crop of tools before adopting them, even if they have been proven elsewhere. The limitations of radiology models stem from deeper problems with building medical AI. Training datasets come with strict inclusion criteria, where the diagnosis must be unambiguous (typically confirmed by a consensus of two to three experts or a pathology result) and without images that are shot at an odd angle, look too dark, or are blurry. This skews performance towards the easiest cases, which doctors are already best at diagnosing, and away from real-world images. In one 2022 study, an algorithm that was meant to spot pneumonia on chest X-rays faltered when the disease presented in subtle or mild forms, or when other lung conditions resembled pneumonia, such as pleural effusions, where fluid builds up in lungs, or in atelectasis (collapsed lung). Humans also benefit from context: one radiologist told me about a model they use that labels surgical staples as hemorrhages, because of the bright streaks they create in the image. Medical imaging datasets used for training also tend to have fewer cases from children, women, and ethnic minorities, making their performance generally worse for these demographics. Many lack information about the gender or race of cases at all, making it difficult to adjust for these issues and address the problem of bias. The result is that radiology models often predict only a narrow slice of the world, though there are scenarios where AI models do perform well, including identifying common diseases like pneumonia or certain tumors. The problems don’t stop there. Even a model for the precise question you need and in the hospital where it was trained is unlikely to perform as well in clinical practice as it did in the benchmark. In benchmark studies, researchers isolate a cohort of scans, define goals in quantitative metrics, such as the sensitivity (the percentage of people with the condition who are correctly identified by the test) and specificity (the percentage of people without the condition who are correctly identified as such), and compare the performance of a model to the score of another reviewer, typically a human doctor. Clinical studies, on the other hand, show how well the model performs in a real healthcare setting without controls. Since the earliest days of computer-aided diagnosis, there has been a gulf between benchmark and clinical performance. In the 1990s, computer-aided diagnosis, effectively rudimentary AI systems, were developed to screen mammograms, or X-rays of breasts that are performed to look for breast cancer. In trials, the combination of humans and computer-aided diagnosis systems outperformed humans alone in accuracy when evaluating mammograms. More controlled experiments followed, which pointed to computer-aided diagnosis helping radiologists pick up more cancer with minimal costs. The FDA approved mammography computer-aided diagnosis in 1998, and Medicare started to reimburse the use of computer-aided diagnosis in 2001. The US government paid radiologists $7 more to report a screening mammogram if they used the technology; by 2010, approximately 74 percent of mammograms in the country were read by computer-aided diagnosis alongside a clinician. But computer-aided diagnosis turned out to be a disappointment. Between 1998 and 2002 researchers analyzed 430,000 screening mammograms from 200,000 women at 43 community clinics in Colorado, New Hampshire, and Washington. Among the seven clinics that turned to computer-aided detection software, the machines flagged more images, leading to clinicians conducting 20 percent more biopsies, but uncovering no more cancer than before. Several other large clinical studies had similar findings. Another way to measure performance is to compare having computerized help to a second clinician reading every film, called ‘double reading’. Across ten trials and seventeen studies of double reading, researchers found that computer aids did not raise the cancer detection rate but led to patients being called back an additional ten percent more often. In contrast, having two readers caught more cancers while slightly lowering callbacks. Computer-aided detection was worse than standard care, and much worse than another pair of eyes. In 2018, Medicare stopped reimbursing doctors more for mammograms read with computer-aided diagnosis than those read by a radiologist alone. One explanation for this gap is that people behave differently if they are treating patients day to day than when they are part of laboratory studies or other controlled experiments. In particular, doctors appear to defer excessively to assistive AI tools in clinical settings in a way that they do not in lab settings. They did this even with much more primitive tools than we have today: one clinical trial all the way back in 2004 asked 20 breast screening specialists to read mammogram cases with the computer prompts switched on, then brought in a new group to read the identical films without the software. When guided by computer aids, doctors identified barely half of the malignancies, while those reviewing without the model caught 68 percent. The gap was largest when computer aids failed to recognize the malignancy itself; many doctors seemed to treat an absence of prompts as reassurance that a film was clean. Another review, this time from 2011, found that when a system gave incorrect guidance, clinicians were 26 percent more likely to make a wrong decision than unaided peers. Works in Progress is becoming a print magazine – you can subscribe here . Humans in the loop It would seem as if better models and more automation could together fix the problems of current-day AI for radiology. Without a doctor involved whose behavior might change we might expect real-world results to match benchmark scores. But regulatory requirements and insurance policies are slowing the adoption of fully autonomous radiology AI. The FDA splits imaging software into two regulatory lanes: assistive or triage tools, which require a licensed physician to read the scan and sign the chart, and autonomous tools, which do not. Makers of assistive tools simply have to show that their software can match the performance of tools that are already on the market. Autonomous tools have to clear a much higher bar: they must demonstrate that the AI tool will refuse to read any scan that is blurry, uses an unusual scanner, or is outside its competence. The bar is higher because, once the human disappears, a latent software defect could harm thousands before anyone notices. Meeting that criteria is difficult. Even state-of-the-art vision networks falter with images that lack contrast, have unexpected angles, or lots of different artefacts. IDx-DR, a diabetic retinopathy screener and one of the few cleared to operate autonomously, comes with guardrails: the patient must be an adult with no prior retinopathy diagnosis; there must be two macula-centred photographs of the fundus (the rear of the eye) with a resolution of at least 1,000 times 1,000 pixels; if glare, small pupils or poor focus degrade quality, the software must self-abort and refer the patient to an eye care professional. Stronger evidence and improved performance could eventually clear both hurdles, but other requirements would still delay widespread use. For example, if you retrain a model, you are required to receive new approval even if the previous model was approved. This contributes to the market generally lagging behind frontier capabilities. And when autonomous models are approved, malpractice insurers are not eager to cover them. Diagnostic error is the costliest mistake in American medicine, resulting in roughly a third of all malpractice payouts, and radiologists are perennial defendants. Insurers believe that software makes catastrophic payments more likely than a human clinician, as a broken algorithm can harm many patients at once. Standard contract language now often includes phrases such as: ‘Coverage applies solely to interpretations reviewed and authenticated by a licensed physician; no indemnity is afforded for diagnoses generated autonomously by software’. One insurer, Berkley, even carries the blunter label ‘Absolute AI Exclusion’. Without malpractice coverage, hospitals cannot afford to let algorithms sign reports. In the case of IDx-DR, the vendor, Digital Diagnostics, includes a product liability policy and an indemnity clause. This means that if the clinic used the device exactly as the FDA label prescribes, with adult patients, good-quality images, and no prior retinopathy, then the company will reimburse the clinic for damages traceable to algorithmic misclassification. Today, if American hospitals wanted to adopt AI for fully independent diagnostic reads, they would need to believe that autonomous models deliver enough cost savings or throughput gains to justify pushing for exceptions to credentialing and billing norms. For now, usage is too sparse to make a difference. One 2024 investigation estimated that 48 percent of radiologists are using AI at all in their practice. A 2025 survey reported that only 19 percent of respondents who have started piloting or deploying AI use cases in radiology reported a ‘high’ degree of success. Better AI, more MRIs Even if AI models become accurate enough to read scans on their own and are cleared to do so, radiologists may still find themselves busier, rather than out of a career. Radiologists are useful for more than reading scans; a study that followed staff radiologists in three different hospitals in 2012 found that only 36 percent of their time was dedicated to direct image interpretation. More time is spent on overseeing imaging examinations, communicating results and recommendations to the treating clinicians and occasionally directly to patients, teaching radiology residents and technologists who conduct the scans, and reviewing imaging orders and changing scanning protocols. This means that, if AI were to get better at interpreting scans, radiologists may simply shift their time toward other tasks. This would reduce the substitution effect of AI. As tasks get faster or cheaper to perform, we may also do more of them. In some cases, especially if lower costs or faster turnaround times open the door to new uses, the increase in demand can outweigh the increase in efficiency, a phenomenon known as Jevons paradox. This has historical precedent in the field: in the early 2000s hospitals swapped film jackets for digital systems. Hospitals that digitized improved radiologist productivity, and time to read an individual scan went down. A study at Vancouver General found that the switch boosted radiologist productivity 27 percent for plain radiography and 98 percent for CT within a year of going filmless. This occurred alongside other advancements in imaging technology that made scans faster to execute. Yet, no radiologists were laid off. Instead, the overall American utilization rate per 1,000 insured patients for all imaging increased by 60 percent from 2000 to 2008. This is not explained by a commensurate increase in physician visits. Instead, each visit was associated with more imaging on average. Before digitization, the nonmonetary price of imaging was high: the median reporting turnaround time for x-rays was 76 hours for patients discharged from emergency departments, and 84 hours for admitted patients. After departments digitized, these times dropped to 38 hours and 35 hours, respectively. Faster scans give doctors more options. Until the early 2000s, only exceptional trauma cases would receive whole-body CT scans; the increased speed of CT turnaround times mean that they are now a common choice. This is a reflection of elastic demand, a concept in economics that describes when demand for a product or service is very sensitive to changes in price. In this case, when these scans got cheaper in terms of waiting time, demand for those scans increased. The first decade of diffusion Over the past decade, improvements in image interpretation have run far ahead of their diffusion. Hundreds of models can spot bleeds, nodules, and clots, yet AI is often limited to assistive use on a small subset of scans in any given practice. And despite predictions to the contrary, head counts and salaries have continued to rise. The promise of AI in radiology is overstated by benchmarks alone. Multi‑task foundation models may widen coverage, and different training sets could blunt data gaps. But many hurdles cannot be removed with better models alone: the need to counsel the patient, shoulder malpractice risk, and receive accreditation from regulators. Each hurdle makes full substitution the expensive, risky option and human plus machine the default. Sharp increases in AI capabilities could certainly alter this dynamic, but it is a useful model for the first years of AI models that benchmark well at tasks associated with a particular career. However, there are industries where conditions are different. Large platforms rely heavily on AI systems to triage or remove harmful or policy-violating content. At Facebook and Instagram, 94 percent and 98 percent of moderation decisions respectively are made by machines. But many of the more sophisticated knowledge jobs look more like radiology. In many jobs, tasks are diverse, stakes are high, and demand is elastic. When this is the case, we should expect software to, at least initially, lead to more human work, not less. The lesson from a decade of radiology models is neither optimism about increased output nor dread about replacement. Models can lift productivity, but their implementation depends on behavior, institutions and incentives. For now, the paradox has held: the better the machines, the busier radiologists have become. Deena Mousa is a lead researcher at Open Philanthropy. Follow her on Twitter.