March 9, 2018 § Leave a comment
How should we interpret research reporting the performance of #AI in clinical practice?
[This blog collects together in one place a twitter review published 9 March 2018 at https://twitter.com/EnricoCoiera/status/971886744101515265 ]
Today we are reading “Watson for Oncology and breast cancer treatment recommendations: agreement with an expert multidisciplinary tumor board” that has just appeared in the Annals of Oncology.
This paper studies the degree of agreement or “concordance” between Watson for Oncology (WFO) and an expert panel of clinicians on a ’tumor board’. It reports an impressive 93% concordance between human experts and WFO when recommending treatment for breast cancer.
Unfortunately the paper is not open access but you can read the abstract. I’d suggest reading the paper before you read further into this thread. My question to you: Do the paper’s methods allow us to have confidence in the impressive headline result?
We should begin by congratulating the authors on completing a substantial piece of work in an important area. What follows is the ‘review’ I would have written if the journal had asked me. It is not a critique of individuals or technology and it is presented for educational purposes.
Should be believe the results are valid, and secondly do we believe that they generalizable to other places or systems? To answer this we need to examine the quality of the study, the quality of the data analysis, and the accuracy of the conclusions drawn from the analysis.
I like to tease apart the research methods section using PICO headings – Population, Intervention, Comparator, Outcome.
(P)opulation. 638 breast cancer patients presented between 2014-16 at a single institution. However the study excluded patients with colloid, adenocystic, tubular, or secretory breast cancer “since WFO was not trained to offer treatment recommendations for these tumor types”.
So we have our first issue. We are not told how representative this population is of the expected distribution of breast cancer cases at the hospital or in the general population. We need to know if these 3 study years were somehow skewed by abnormal presentations.
We also need to know if this hospital’s case-mix is normal or somehow different to that in others. Its critical because any claim for the results here to generalize elsewhere depends on how representative the population is.
Also, what do we think of the phrase “since WFO was not trained to offer treatment recommendations for these tumor types”? It means that irrespective of how good the research methods are, the result will not necessarily hold for *all breast cancer cases”
All they can claim is that any result holds for this subset of breast cancers . We have no evidence presented to suggest that performance would be the same on the excluded cancers. Unfortunately the abstract and paper results section do not include this very important caveat.
(I)ntervention. I’m looking for a clear description of the intervention to understand 1/ Could someone replicate this intervention independently to validate the results? 2/ What exactly was done so that we can connect cause (intervention) with effect (the outcomes)?
WFO is the intervention. But it is never explicitly described. For any digital intervention, even if you don’t tell me what is inside the black box, I need 2 things: 1/ Describe exactly the INPUT into the system and 2/ describe exactly the OUTPUT from the system.
This paper unfortunately does neither. There is no example of how a cancer case is encoded for WFO to interpret, nor is there is an example of a WFO recommendation that humans need to read.
So the intervention is not reported in enough detail for independent replication, we do not have enough detail to understand the causal mechanism tested, and we don’t know if biases are hidden in the intervention. This makes it hard to judge study validity or generalizability
Digging into an online appendix, we do discover some details about input, and WFO mechanism. It appears WFO is input a feature vector:
It also appears that vector is then input to an unspecified machine learning engine that presumably associates disease vectors with treatments.
So, for all the discussion of WFO as a text-processing machine, at its heart might be a statistical classification engine. It’s a pity we don’t know any more, and it’s a pity we don’t know how much human labour is hidden in preparing feature vectors and training cases.
But there is enough detail in the online appendix to get a feeling about what was in general done. They really should have been much more explicit and included the input feature vector, and output treatment recommendation description in the main paper.
But was WFO really the only intervention? No. there was also a human intervention that needs to be accounted for and it might very well have been responsible for some % of the results.
The Methods reports that two humans (trained senior oncology fellows) entered the data manually into WFO. They read the cases, identified the data that matched the input features, and then decided the scores for each feature. What does this mean?
Firstly, there was no testing of inter-rater reliability. We don’t know if the two humans coded the same cases in the same way. Normally a kappa statistic is provided to measure the degree of agreement between humans and allow us to get a sense for how replicable what they did was.
For example, a low kappa means that agreement is low and that therefore any results are less likely to replicate in a future study. We really need that kappa to trust the methods were robust.
If humans are pre-digesting data for WFO how much of WFO’s performance is due to the way humans seek, identify and synthesize data? We don’t know. One might argue that the hard information detection and analysis task is done by humans and the easier classification task by WFO.
So far we have got to the point where the papers’ results are not that WFO had a 93% concordance with human experts, but rather that, when humans from a single institution read cancer cases from that institution, and extract data specifically in the way that WFO needs it, and also when a certain group of breast cancers are excluded, then concordance is 93%. That is quite a list of caveats already.
(C)ompartor: The treatments recommended by WFO are compared to the consensus recommendations of a human group of experts. The authors rightly noted that treatments might have changed between the time the humans recommended a treatment and WFO gave its recommendation. So they calculate concordance *twice*.
The first comparison is between treatment recommendations of the tumor board and WFO. The second is not so straightforward. All the cases in the first comparison for which there was no human/WFO agreement were taken back to the humans, who were asked if their opinion had changed since they last considered the case. A new comparison was then made between this subset of cases and WFO, and the 93% figure comes from this 2nd comparison.
We now have a new problem. You know you are at risk of introducing bias in a study if you do something to one group that is different to what you do to the other groups, but still pool the results. In this case, the tumor board was asked to re-consider some cases but not others.
The reason for not looking at cases for which there had been original agreement could only be that we can safely assume that the tumor board’s views would not have changed over time. The problem is that we cannot safely assume that. There is every reason to believe that for some of these cases, the board would have changed its view.
As a result, the only outcomes possible at the second comparisons are either no change in concordance or improvement in concordance. The system is inadvertently ’rigged’ to prevent us discovering there was a decrease in concordance over time because the cases that might show a decrease are excluded from measurement.
From my perspective as a reviewer that means I can’t trust the data in the second comparison because of a high risk of bias. The experiment would need to be re-run allowing all cases to be reconsidered. So, if we look at the WFO concordance rate at first comparison, which is now all I think we reasonably can look at, it drops from 93% to 73% (Table 2).
(O)utcome: The outcome of concordance is a process not a clinical outcome. That means that it measures a step in a pathway, but tells us nothing about what might have happened to a patient at the end of the pathway. To do that we would need some sort of conversion rate.
For example if we knew for x% of cases in which WFO suggested something humans had not considered, that humans would change their mind, this would allow us to gauge how important concordance was in shaping human decision making. Ideally we would like to know a ‘number needed to treat’ i.e. how many patients need to have their case considered by WFO for 1 patient to materially benefit e.g. live rather than die.
So whilst process outcomes are great early stepping-stones in assessing clinical interventions, they really cannot tell us much about eventual real world impact. At best they are a technical checkpoint as we gather evidence that a major clinical trial is worth doing.
In this paper, concordance is defined as a tricky composite variable. Concordance = all those cases for which WFO’s *recommendation* agreed with the human recommendation + all those cases in which the human recommendation appeared in a secondary WFO list of *for consideration* treatments.
The very first question I now want to know is ‘how often did human and WFO actually AGREE on a single treatment?” Data from the first human-WFO comparison point indicates that there was agreement on the *recommended* treatment in 46% of cases. That is a very different number to 93% or even 73%.
What is the problem with including the secondary ‘for consideration’ recommendations from WFO? In principle nothing, as we are at this point only measuring process not outcomes, but it might need to be measured in a very different way than it is at present for the true performance to be clear.
Our problem is that if we match a human recommended treatment to a WFO recommendation it is a 1:1 comparison. If we match human recommended to WFO ‘for consideration’ it is 1:x. Indeed from the paper I don’t know what x is. How long is this additional list of ‘for consideration’ treatments. 2,5,10? You can see the problem. If WFO just listed out the next 10 most likely treatments, there is a good chance the human recommended treatment might appear somewhere in the list. If it had listed just 2 that would be more impressive.
In information retrieval, there are metrics that deal with this type of data, and they might have been used here. You could for example report the median rank at which WFO ‘for considerations’ matched human recommendations. IN other words, rather than treating it as a binary yes/no we would better assess performance by measuring where in the list the correct recommendation is made.
Now lets us turn to the results proper. I have already explained my concerns about the headline results around concordance.
Interestingly concordance is reportedly higher for stage II and III disease and lower for stage 1 and 1V. That is interesting, but odd to me. Why would this be the case? Well, this study was not designed to test for whether concordance changes with stage – this is a post-hoc analysis. To do such a study, we would likely have had to have seek equal samples of each stage of disease. Looking at the data, stage I cases are significantly underrepresented compared to other stages. So interesting, but treat with caution.
There are other post-hoc analyses of receptor status and age and the relationship with concordance. Again this is interesting but post-hoc so treat it with caution.
Interestingly concordance decreased with age – and that might have something to do with external factors like co-morbidity starting to affect treatment recommendations. Humans might for example, reason that aggressive treatment may not make sense to a cancer patient with other illnesses and at a certain age. The best practice encoded in WFO might not take such preferences and nuances into account.
Limitations: The authors do a very good job overall in identifying many, but not all, of the issues I raise above. Despite these identified limitations, which I think are significant, they still see WFO’s producing a “high degree of agreement’ with humans. I don’t think the data yet supports that conclusion.
Conclusions: The authors conclude by suggest that as a result of this study WFO might be useful for centers with limited breast cancer resources. The evidence in this study doesn’t yet support such a conclusion. We have some data on concordance, but no data on how concordance affects human decisions, and no data on how changed decisions affects patient outcomes. Those two giant evidence gaps mean it might be a while before we safely trial WFO in real life.
So, in summary, my takeaway is that WFO generated the same treatment recommendation as humans for a subset of breast cancers at a single institution in 46% of cases. I am unclear how much influence human input had in presenting data to WFO, and there is a chance the performance might have been worse without human help (e.g. if WFO did its own text processing).
I look forward to hearing more about WFO and similar systems in other studies, and hope this review can help in framing future study designs. I welcome comments on this review.
[Note: There is rich commentary associated with the original Twitter thread, so it is worth reading if you wish to see additional issues and suggestions from the research community.]
February 11, 2016 § 6 Comments
Have we reached peak e-health yet?
Anyone who works in the e-health space lives in two contradictory universes.
The first universe is that of our exciting digital health future. This shiny gadget-laden paradise sees technology in harmony with the health system, which has become adaptive, personal, and effective. Diseases tumble under the onslaught of big data and miracle smart watches. Government, industry, clinicians and people off the street hold hands around the bonfire of innovation. Teeth are unfeasibly white wherever you look.
The second universe is Dickensian. It is the doomy world in which clinicians hide in shadows, forced to use clearly dysfunctional IT systems. Electronic health records take forever to use, and don’t fit clinical work practice. Health providers hide behind burning barricades when the clinicians revolt. Government bureaucrats in crisp suits dissemble in velvet-lined rooms, softly explaining the latest cost overrun, delay, or security breach. Our personal health files get passed by street urchins hand-to-hand on dirty thumbnail drives, until they end up in the clutches of Fagin like characters.
Both of these universes are real. We live in them every day. One is all upside, the other mostly down. We will have reached peak e-health the day that the downside exceeds the upside and stays there. Depending on who you are and what you read, for many clinicians, we have arrived at that point.
The laws of informatics
To understand why e-health often disappoints requires some perspective and distance. Informed observers again and again see the same pattern of large technology driven projects sucking up all the e-health oxygen and resources, and then failing to deliver. Clinicians see that the technology they can buy as a consumer is more beautiful and more useful that anything they encounter at work.
I remember a meeting I attended with Branko Cesnik. After a long presentation about a proposed new national e-health system, focusing entirely on technical standards and information architectures, Branko piped up: “Excuse me, but you’ve broken the first law of informatics”. What he meant was that the most basic premise for any clinical information system is that it exists to solve a clinical problem. If you start with the technology, and ignore the problem, you will fail.
There are many corollary informatics laws and principles. Never build a clinical system to solve a policy or administrative problem unless it is also solving a clinical problem. Technology is just one component of the socio-technical system, and building technology in isolation from that system just builds an isolated technology .
Breaking the laws of informatics
So, no e-health project starts in a vacuum of memory. Rarely do we need to design a system from first principles. We have many decades of experience to tell us what the right thing to do is. Many decades of what not to do sits on the shelf next to it. Next to these sits the discipline of health informatics itself. Whilst it borrows heavily from other disciplines, it has its own central reason to exist – the study of the health system, and of how to design ways of changing it for the better, supported by technology. Informatics has produced research in volume.
Yet today it would be fair to say that most people who work in the e-health space don’t know that this evidence exists, and if they know it does exist, they probably discount it. You might hear “N of 1” excuse making, which is the argument that the evidence “does not apply here because we are different” or “we will get it right where others have failed because we are smarter”. Sometimes system builders say that the only evidence that matters is their personal experience. We are engineers after all, and not scientists. What we need are tools, resources, a target and a deadline, not research.
Well, you are not different. You are building a complex intervention in a complex system, where causality is hard to understand, let alone control. While the details of your system might differ, from a complexity science perspective, each large e-health project ends up confronting the same class of nasty problem.
The results of ignoring evidence from the past are clear to see. If many of the clinical information systems I have seen were designed according to basic principles of human factors engineering, I would like to know what those principles are. If most of today’s clinical information systems are designed to minimize technology-induced harm and error, I will hold a party and retire, my life’s work done.
The basic laws of informatics exist, but they are rarely applied. Case histories are left in boxes under desks, rather than taught to practitioners. The great work of the informatics research community sits gathering digital dust in journals and conference proceedings, and does not inform much of what is built and used daily.
None of this story is new. Many other disciplines have faced identical challenges. The very name Evidence-based Medicine (EBM), for example, is a call to arms to move from anecdote and personal experience, towards research and data driven decision-making. I remember in the late ‘90s, as the EBM movement started (and it was as much a social movement as anything else), just how hard the push back was from the medical profession. The very name was an insult! EBM was devaluing the practical, rich daily experience of every doctor, who knew their patients ‘best’, and every patient was ‘different’ to those in the research trials. So, the evidence did not apply.
EBM remains a work in progress. All you need to do today is to see a map of clinical variation to understand that much of what is done remains without an evidence base to support it. Why is one kind of prosthetic hip joint used in one hospital, but a different one in another, especially given the differences in cost, hip failure and infection? Why does one developed country have high caesarian section rates when a comparable one does not? These are the result of pragmatic ‘engineering’ decisions by clinicians – to attack the solution to a clinical problem one way, and not another. I don’t think healthcare delivery is so different to informatics in that respect.
Is it time for evidence-based health informatics?
It is time we made the praxis of informatics evidence-based.
That means we should strive to see that every decision that is made about the selection, design, implementation and use of an informatics intervention is based on rigorously collected and analyzed data. We should choose the option that is most likely to succeed based on the very best evidence we have.
For this to happen, much needs to change in the way that research is conducted and communicated, and much needs to happen in the way that informatics is practiced as well:
- We will need to develop a rich understanding of the kinds of questions that informatics professionals ask every day;
- Where the evidence to answer a question exists, we need robust processes to synthesize and summarize that evidence into practitioner actionable form;
- Where the evidence does not exist and the question is important, then it is up to researchers to conduct the research that can provide the answer.
In EBM, there is a lovely notion that we need problem oriented evidence that matters (POEM)  (covered in some detail in Chapter 6 of The Guide to Health Informatics). It is easy enough to imagine the questions that can be answered with informatics POEMs:
- What is the safe limit to the number of medications I can show a clinician in a drop-down menu?
- I want to improve medication adherence in my Type 2 Diabetic patients. Is a text message reminder the most cost-effective solution?
- I want to reduce the time my docs spend documenting in clinic. What is the evidence that an EHR can reduce clinician documentation time?
- How gradually should I roll out the implementation of the new EHR in my hospital?
- What changes will I need to make to the workflow of my nursing staff if I implement this new medication management system?
EBM also emphasises that the answer to any question is never an absolute one based on the science, because the final decision is also shaped by patient preferences. A patient with cancer may choose a treatment that is less likely to cure them, because it is also less likely to have major side-effects, which is important given their other goals. The same obviously holds in evidence-based health informatics (EBHI).
The Challenges of EBHI
Making this vision come true would see some significant long term changes to the business of health informatics research and praxis:
- Questions: Practitioners will need develop a culture of seeking evidence to answer questions, and not simply do what they have always done, or their colleagues do. They will need to be clear about their own information needs, and to be trained to ask clear and answerable questions. There will need to be a concerted partnership between practitioners and researchers to understand what an answerable question looks like. EBM has a rich taxonomy of question types and the questions in informatics will be different, emphasizing engineering, organizational, and human factors issues amongst others. There will always be questions with no answer, and that is the time experience and judgment come to the fore. Even here though, analytic tools can help informaticians explore historical data to find the best historical evidence to support choices.
- Answers: The Cochrane Collaboration helped pioneer the development of robust processes of meta-analysis and systematic review, and the translation of these into knowledge products for clinicians. We will need to develop a new informatics knowledge translational profession that is responsible for understanding informatics questions, and finding methods to extract the most robust answers to them from the research literature and historical data. As much of this evidence does not typically come from randomised controlled trials, other methods than meta-analysis will be needed. Case libraries, which no doubt exist today, will be enhanced and shaped to support the EBHI enterprise. Because we are informaticians, we will clearly favor automated over manual ways of searching for, and summarizing, the research evidence . We will also hopefully excel at developing the tools that practitioners use to frame their questions and get the answers they need. There are surely both public good and commercial drivers to support the creation of the knowledge products we need.
- Bringing implementation science to informatics: We know that informatics interventions are complex interventions in complex systems, and that the effect of these interventions vary depending on the organisational context. So, the practice of EBHI will of necessity see answers to questions being modified because of local context. I suspect that this will mean that one of the major research challenges to emerge from embracing EBHI is to develop robust and evidence-based methods to support localization or contextualisation of knowledge. While every context is no doubt unique, we should be able to draw upon the emerging lessons of implementation science to understand how to support local variation in a way that is most likely to see successful outcomes.
- Professionalization: Along with culture change would come changes to the way informatics professionals are accredited, and reaccredited. Continuing professional education is a foundation of the reaccreditation process, and provides a powerful opportunity for professionals to catch up with the major changes in science, and how those changes impact the way they should approach their work.
There comes a moment when surely it is time to declare that enough is enough. There is an unspoken crisis in e-health right now. The rhetoric of innovation, renewal, modernization and digitization make us all want to believers. The long and growing list of failed large-scale e-health projects, the uncomfortable silence that hangs when good people talk about the safety risks of technology, make some think that e-health is an ill-conceived if well intentioned moment in the evolution of modern health care. This does not have to be.
To avoid peak e-health we need to not just minimize the downside of what we do by avoiding mistakes. We also have to maximize the upside, and seize the transformative opportunities technology brings.
Everything I have seen in medicine’s journey to become evidence-based tells me that this will not be at all easy to accomplish, and that it will take decades. But until we do, the same mistakes will likely be rediscovered and remade.
We have the tools to create a different universe. What is needed is evidence, will, a culture of learning, and hard work. Less Dickens and dystopia. More Star Trek and utopia.
Since I wrote this blog a collection of important papers covering the important topic of how we evaluate health informatics and choose which technologies are fit for purpose has been published in the book Evidence-based Health Informatics.
- Slawson DC, Shaughnessy AF, Bennett JH. Becoming a medical information master: feeling good about not knowing everything. The Journal of Family Practice 1994;38(5):505-13
- Tsafnat G, Glasziou PP, Choong MK, et al. Systematic Review Automation Technologies. Systematic Reviews 2014;3(1):74
- Coiera E. Four rules for the reinvention of healthcare. BMJ 2004;328(7449):1197-99