[For convenience I collect here and slightly rearrange and update my Twitter review of a recent paper comparing the performance of the Babylon chatbot against human doctors. As ever my purpose is to focus on the scientific quality of the paper, identify design weaknesses, and suggest improvements in future studies.]
Here is my peer review of the Babylon chatbot as described in the conference paper at https://marketing-assets.babylonhealth.com/press/BabylonJune2018Paper_Version1.4.2.pdf
Please feel free to correct any misunderstandings I have of the evaluation in the tweets that follow.
To begin, the Babylon engine is a Bayesian reasoner. That’s cool. Not sure if it qualifies as AI.
The evaluation uses artificial patient vignettes which are presented in a structured format to human GPs or a Babylon operator. So the encounter is not naturalistic. It doesn’t test Babylon in front of real patients.
In the vignettes, patients were played by GPs, some of whom were employed by Babylon. So they might know how Babylon liked information to be presented and unintentionally advantaged it. Using independent actors, or ideally real patients, would have had more ecological validity.
A human is part of the Babylon intervention because a human has to translate the presented vignette and enter it into Babylon. In other words, the human heard information that they then translated into terminology Babylon recognises. The impact of this human is not explicitly measured. For example, rather than being a ‘mere’ translator, the human may occasionaly have had to make clinical inferences to match case data to Babylon capabilty. The knowledge to do that would thus be be external to Babylon and yet contribute to its performance, if true.
The vignettes were designed to test know capabilities of the system. Independently created vignettes exploring other diagnoses would likely have resulted in a much poorer performance. This tests Babylon on what it knows not what it might find ‘in the wild’.
It seems the presentation of information was in the OSCE format, which is artificial and not how patients might present. So there was no real testing of consultation and listening skills that would be needed to manage a real world patient presentation.
Babylon is a Bayesian reasoner but no information was presented on the ‘tuning’ of priors required to get this result. This makes replication hard. A better paper would provide the diagnostic models to allow independent validation.
The quality of differential diagnoses by humans and Babylon was assessed by one independent individual. In addition two Babylon employees also rated differential diagnosis quality. Good research practice is to use multiple independent assessors and measure inter-rater reliability.
The safety assessment has the same flaw. Only 1 independent assessor was used and no inter-rater reliability measures are presented when several in-house assessors are added. Non-independent assessors bring a risk of bias.
To give some further evaluation, additional vignettes are used, based on MRCGP tests. However any vignettes outside of the Babylon system’s capability were excluded. They only tested Babylon on vignettes it had a chance to get right.
So, whilst it might be ok to allow Babylon to only answer questions that it is good at for limited testing, the humans did not have a reciprocal right to exclude vignettes they were not good at. This is fundamental bias in the evaluation design.
A better evaluation model would have been to draw a random subset of cases and present them to both GPs and Babylon.
No statistical testing is done to check if the differences reported are likely due to chance variation. A statistically rigorous study would estimate the likely effect size and use that to determine the sample size needed to detect a difference between machine and human.
For the first evaluation study, methods tells us “The study was conducted in four rounds over consecutive days. In each round, there were up to four “patients” and four doctors.” That should mean each doctor and Babylon should have seen “up to” 16 cases.
Table 1 shows Babylon used on 100 vignettes and doctors typically saw about 50. This makes no sense. Possibly they lump in the 30 Semigran cases reported separately but that still does not add up. Further as the methods for Semigran were different they cannot be added in any case.
There is a problem with Doctor B who completes 78 vignettes. The others do about 50. Further looking at Table 1 and Fig 1 Doctor B is an outlier, performing far worse than the others diagnostically. This unbalanced design may mean average doctor performance is penalised by Doctor B or the additional cases they saw.
Good research practice is to report who study subjects are and how they were recruited. All we know is that these were locum GPs paid to do the study. We should be told their age, experience and level of training, perhaps where they were trained, whether they were independent or had a prior link to the researchers doing the study etc. We would like to understand if B was somehow “different” because their performance certainly was.
Removing B from the data set and recalculating results shows humans beating Babylon on every measure in Table 1.
With such a small sample size of doctors, the results are thus very sensitive to each individual case and doctor, and adding or removing a doctor can change the outcomes substantially. That is why we need statistical testing.
There is also a Babylon problem. It sees on average about twice as many cases as the doctors. As we have no rule provided for how the additional cases seen by Babylon were selected, there is a risk of selection bias eg what if by chance the ‘easy’ cases were only seen by Babylon?
A better study design would allocate each subject exactly the same cases, to allow meaningful and direct performance comparison. To avoid known biases associated with presentation order of cases, case allocation should be random for each subject.
For the MRCGP questions Babylon’s diagnostic accuracy is measured by its ability to identify a disease within its top 3 differential diagnoses. It identified the right diagnosis in its top 3 in 75% of 36 MRCGP CSA vignettes, and 87% of 15 AKT vignettes.
For the MRCGP questions we are not given Babylon’s performance when the measure is the top differential. Media reports compare Babylon against historical MRCGP human results. One assumes humans had to produce the correct diagnosis, and were not asked for a top 3 differential.
There is huge significance clinically in putting a disease in your top few differential diagnoses and the top one you elect to investigate. It also is an unfair comparison if Babylon is rated by a top 3 differential and humans by a top 1. Clarity on this aspect would be valuable.
In closing, we are only ever given one clear head to head comparison between humans and Babylon, and that is on the 30 Semigran cases. Humans outperform Babylon when the measure is the top diagnosis. Even here though, there is no statistical testing.
So, in summary, this is a very preliminary and artificial test of a Bayesian reasoner on cases for which it has already been trained.
In machine learning this would be roughly equivalent to in-sample reporting of performance on the data used to develop the algorithm. Good practice is to report out of sample performance on previously unseen cases.
The results are confounded by artificial conditions and use of few and non-independent assessors.
There is lack of clarity in the way data are analysed and there are numerous risks of bias.
Critically, no statistical testing is performed to tell us whether any of the differences seen mean anything. Further, the small sample size of GPs tested likely means it would be unlikely that this study was adequately powered to see any difference, if it does exist.
So, it is fantastic that Babylon has undertaken this evaluation, and has sought to present it in public via this conference paper. They are to be applauded for that. One of the benefits of going public is that we can now provide feedback on the study’s strength and weaknesses.