Paper Review: the Babylon Chatbot

June 29, 2018 § Leave a comment

[For convenience I collect here and slightly rearrange and update my Twitter review of a recent paper comparing the performance of the Babylon chatbot against human doctors. As ever my purpose is to focus on the scientific quality of the paper, identify design weaknesses, and suggest improvements in future studies.]

 

Here is my peer review of the Babylon chatbot as described in the conference paper at https://marketing-assets.babylonhealth.com/press/BabylonJune2018Paper_Version1.4.2.pdf

Please feel free to correct any misunderstandings I have of the evaluation in the tweets that follow.

To begin, the Babylon engine is a Bayesian reasoner. That’s cool. Not sure if it qualifies as AI.

The evaluation uses artificial patient vignettes which are presented in a structured format to human GPs or a Babylon operator. So the encounter is not naturalistic. It doesn’t test Babylon in front of real patients.

In the vignettes, patients were played by GPs, some of whom were employed by Babylon. So they might know how Babylon liked information to be presented and unintentionally advantaged it. Using independent actors, or ideally real patients, would have had more ecological validity.

A human is part of the Babylon intervention because a human has to translate the presented vignette and enter it into Babylon.  In other words, the human heard information that they then translated into terminology Babylon recognises. The impact of this human is not explicitly measured. For example, rather than being a ‘mere’ translator, the human may occasionaly have had to make clinical inferences to match case data to Babylon capabilty. The knowledge to do that would thus be be external to Babylon and yet contribute to its performance, if true.

The vignettes were designed to test know capabilities of the system. Independently created vignettes exploring other diagnoses would likely have resulted in a much poorer performance. This tests Babylon on what it knows not what it might find ‘in the wild’.

It seems the presentation of information was in the OSCE format, which is artificial and not how patients might present. So there was no real testing of consultation and listening skills that would be needed to manage a real world patient presentation.

Babylon is a Bayesian reasoner but no information was presented on the ‘tuning’ of priors required to get this result. This makes replication hard. A better paper would provide the diagnostic models to allow independent validation.

The quality of differential diagnoses by humans and Babylon was assessed by one independent individual. In addition two Babylon employees also rated differential diagnosis quality. Good research practice is to use multiple independent assessors and measure inter-rater reliability.

The safety assessment has the same flaw. Only 1 independent assessor was used and no inter-rater reliability measures are presented when several in-house assessors are added. Non-independent assessors bring a risk of bias.

To give some further evaluation, additional vignettes are used, based on MRCGP tests. However any vignettes outside of the Babylon system’s capability were excluded. They only tested Babylon on vignettes it had a chance to get right.

So, whilst it might be ok to allow Babylon to only answer questions that it is good at for limited testing, the humans did not have a reciprocal right to exclude vignettes they were not good at. This is fundamental bias in the evaluation design.

A better evaluation model would have been to draw a random subset of cases and present them to both GPs and Babylon.

No statistical testing is done to check if the differences reported are likely due to chance variation. A statistically rigorous study would estimate the likely effect size and use that to determine the sample size needed to detect a difference between machine and human.

For the first evaluation study, methods tells us “The study was conducted in four rounds over consecutive days. In each round, there were up to four “patients” and four doctors.” That should mean each doctor and Babylon should have seen “up to” 16 cases.

Table 1 shows Babylon used on 100 vignettes and doctors typically saw about 50. This makes no sense. Possibly they lump in the 30 Semigran cases reported separately but that still does not add up. Further as the methods for Semigran were different they cannot be added in any case.

There is a problem with Doctor B who completes 78 vignettes. The others do about 50. Further looking at Table 1 and Fig 1 Doctor B is an outlier, performing far worse than the others diagnostically. This unbalanced design my mean average doctor performance is penalised by Doctor B or the additional cases they saw.

Good research practice is to report who study subjects are and how they were recruited. All we know is that these were locum GPs paid to do the study. We should be told their age, experience and level of training, perhaps where they were trained, whether they were independent or had a prior link to the researchers doing the study etc. We would like to understand if B was somehow “different” because their performance certainly was.

Removing B from the data set and recalculating results shows humans beating Babylon on every measure in Table 1.

With such a small sample size of doctors, the results are thus very sensitive to each individual case and doctor, and adding or removing a doctor can change the outcomes substantially. That is why we need statistical testing.

There is also a Babylon problem. It sees on average about twice as many cases as the doctors. As we have no rule provided for how the additional cases seen by Babylon were selected, there is a risk of selection bias eg what if by chance the ‘easy’ cases were only seen by Babylon?

A better study design would allocate each subject exactly the same cases, to allow meaningful and direct performance comparison. To avoid known biases associated with presentation order of cases, case allocation should be random for each subject.

For the MRCGP questions Babylon’s diagnostic accuracy is measured by its ability to identify a disease within its top 3 differential diagnoses. It identified the right diagnosis in its top 3 in 75% of  36 MRCGP CSA vignettes, and 87% of 15 AKT vignettes.

For the MRCGP questions we are not given Babylon’s performance when the measure is the top differential. Media reports compare Babylon against historical MRCGP human results. One assumes humans had to produce the correct diagnosis, and were not asked for a top 3 differential.

There is huge significance clinically in putting a disease in your top few differential diagnoses and the top one you elect to investigate. It also is an unfair comparison if Babylon is rated by a top 3 differential and humans by a top 1. Clarity on this aspect would be valuable.

In closing, we are only ever given one clear head to head comparison between humans and Babylon, and that is on the 30 Semigran cases. Humans outperform Babylon when the measure is the top diagnosis. Even here though, there is no statistical testing.

https:::marketing-assets.babylonhealth.com:press:B

So, in summary, this is a very preliminary and artificial test of a Bayesian reasoner on cases for which it has already been trained.

In machine learning this would be roughly equivalent to in-sample reporting of performance on the data used to develop the algorithm. Good practice is to report out of sample performance on previously unseen cases.

The results are confounded by artificial conditions and use of few and non-independent assessors.

There is lack of clarity in the way data are analysed and there are numerous risks of bias.

Critically, no statistical testing is performed to tell us whether any of the differences seen mean anything. Further, the small sample size of GPs tested likely means it would be unlikely that this study was adequately powered to see any difference, if it does exist.

So, it is fantastic that Babylon has undertaken this evaluation, and has sought to present it in public via this conference paper. They are to be applauded for that. One of the benefits of going public is that we can now provide feedback on the study’s strength and weaknesses.

 

 

Journal Review: Watson for Oncology in Breast Cancer

March 9, 2018 § Leave a comment

How should we interpret research reporting the performance of #AI in clinical practice?

[This blog collects together in one place a twitter review published 9 March 2018 at https://twitter.com/EnricoCoiera/status/971886744101515265 ]

Today we are reading “Watson for Oncology and breast cancer treatment recommendations: agreement with an expert multidisciplinary tumor board” that has just appeared in the Annals of Oncology.

https://academic.oup.com/annonc/article/29/2/418/4781689

This paper studies the degree of agreement or “concordance” between Watson for Oncology (WFO) and an expert panel of clinicians on a ’tumor board’. It reports an impressive 93% concordance between human experts and WFO when recommending treatment for breast cancer.

Unfortunately the paper is not open access but you can read the abstract. I’d suggest reading the paper before you read further into this thread. My question to you: Do the paper’s methods allow us to have confidence in the impressive headline result?

We should begin by congratulating the authors on completing a substantial piece of work in an important area. What follows is the ‘review’ I would have written if the journal had asked me. It is not a critique of individuals or technology and it is presented for educational purposes.

Should be believe the results are valid, and secondly do we believe that they generalizable to other places or systems? To answer this we need to examine the quality of the study, the quality of the data analysis, and the accuracy of the conclusions drawn from the analysis.

I like to tease apart the research methods section using PICO headings – Population, Intervention, Comparator, Outcome.

(P)opulation. 638 breast cancer patients presented between 2014-16 at a single institution. However the study excluded patients with colloid, adenocystic, tubular, or secretory breast cancer “since WFO was not trained to offer treatment recommendations for these tumor types”.

So we have our first issue. We are not told how representative this population is of the expected distribution of breast cancer cases at the hospital or in the general population. We need to know if these 3 study years were somehow skewed by abnormal presentations.

We also need to know if this hospital’s case-mix is normal or somehow different to that in others. Its critical because any claim for the results here to generalize elsewhere depends on how representative the population is.

Also, what do we think of the phrase “since WFO was not trained to offer treatment recommendations for these tumor types”? It means that irrespective of how good the research methods are, the result will not necessarily hold for *all breast cancer cases”

All they can claim is that any result holds for this subset of breast cancers . We have no evidence presented to suggest that performance would be the same on the excluded cancers. Unfortunately the abstract and paper results section do not include this very important caveat.

(I)ntervention. I’m looking for a clear description of the intervention to understand 1/ Could someone replicate this intervention independently to validate the results? 2/ What exactly was done so that we can connect cause (intervention) with effect (the outcomes)?

WFO is the intervention. But it is never explicitly described. For any digital intervention, even if you don’t tell me what is inside the black box, I need 2 things: 1/ Describe exactly the INPUT into the system and 2/ describe exactly the OUTPUT from the system.

This paper unfortunately does neither. There is no example of how a cancer case is encoded for WFO to interpret, nor is there is an example of a WFO recommendation that humans need to read.

So the intervention is not reported in enough detail for independent replication, we do not have enough detail to understand the causal mechanism tested, and we don’t know if biases are hidden in the intervention. This makes it hard to judge study validity or generalizability

Digging into an online appendix, we do discover some details about input, and WFO mechanism. It appears WFO is input a feature vector:

 

supplement_mdx781.docx 2018-03-08 13-13-36

It also appears that vector is then input to an unspecified machine learning engine that presumably associates disease vectors with treatments.

supplement_mdx781.docx 2018-03-08 13-14-40

So, for all the discussion of WFO as a text-processing machine, at its heart might be a statistical classification engine. It’s a pity we don’t know any more, and it’s a pity we don’t know how much human labour is hidden in preparing feature vectors and training cases.

But there is enough detail in the online appendix to get a feeling about what was in general done. They really should have been much more explicit and included the input feature vector, and output treatment recommendation description in the main paper.

But was WFO really the only intervention? No. there was also a human intervention that needs to be accounted for and it might very well have been responsible for some % of the results.

The Methods reports that two humans (trained senior oncology fellows) entered the data manually into WFO. They read the cases, identified the data that matched the input features, and then decided the scores for each feature. What does this mean?

Firstly, there was no testing of inter-rater reliability. We don’t know if the two humans coded the same cases in the same way. Normally a kappa statistic is provided to measure the degree of agreement between humans and allow us to get a sense for how replicable what they did was.

For example, a low kappa means that agreement is low and that therefore any results are less likely to replicate in a future study. We really need that kappa to trust the methods were robust.

If humans are pre-digesting data for WFO how much of WFO’s performance is due to the way humans seek, identify and synthesize data? We don’t know. One might argue that the hard information detection and analysis task is done by humans and the easier classification task by WFO.

So far we have got to the point where the papers’ results are not that WFO had a 93% concordance with human experts, but rather that, when humans from a single institution read cancer cases from that institution, and extract data specifically in the way that WFO needs it, and also when a certain group of breast cancers are excluded, then concordance is 93%. That is quite a list of caveats already.

(C)ompartor: The treatments recommended by WFO are compared to the consensus recommendations of a human group of experts. The authors rightly noted that treatments might have changed between the time the humans recommended a treatment and WFO gave its recommendation. So they calculate concordance *twice*.

The first comparison is between treatment recommendations of the tumor board and WFO. The second is not so straightforward. All the cases in the first comparison for which there was no human/WFO agreement were taken back to the humans, who were asked if their opinion had changed since they last considered the case. A new comparison was then made between this subset of cases and WFO, and the 93% figure comes from this 2nd comparison.

We now have a new problem. You know you are at risk of introducing bias in a study if you do something to one group that is different to what you do to the other groups, but still pool the results. In this case, the tumor board was asked to re-consider some cases but not others.

The reason for not looking at cases for which there had been original agreement could only be that we can safely assume that the tumor board’s views would not have changed over time. The problem is that we cannot safely assume that. There is every reason to believe that for some of these cases, the board would have changed its view.

As a result, the only outcomes possible at the second comparisons are either no change in concordance or improvement in concordance. The system is inadvertently ’rigged’ to prevent us discovering there was a decrease in concordance over time because the cases that might show a decrease are excluded from measurement.

From my perspective as a reviewer that means I can’t trust the data in the second comparison because of a high risk of bias. The experiment would need to be re-run allowing all cases to be reconsidered. So, if we look at the WFO concordance rate at first comparison, which is now all I think we reasonably can look at, it drops from 93% to 73% (Table 2).

(O)utcome: The outcome of concordance is a process not a clinical outcome. That means that it measures a step in a pathway, but tells us nothing about what might have happened to a patient at the end of the pathway. To do that we would need some sort of conversion rate.

For example if we knew for x% of cases in which WFO suggested something humans had not considered, that humans would change their mind, this would allow us to gauge how important concordance was in shaping human decision making. Ideally we would like to know a ‘number needed to treat’ i.e. how many patients need to have their case considered by WFO for 1 patient to materially benefit e.g. live rather than die.

So whilst process outcomes are great early stepping-stones in assessing clinical interventions, they really cannot tell us much about eventual real world impact. At best they are a technical checkpoint as we gather evidence that a major clinical trial is worth doing.

In this paper, concordance is defined as a tricky composite variable. Concordance = all those cases for which WFO’s *recommendation* agreed with the human recommendation + all those cases in which the human recommendation appeared in a secondary WFO list of *for consideration* treatments.

The very first question I now want to know is ‘how often did human and WFO actually AGREE on a single treatment?” Data from the first human-WFO comparison point indicates that there was agreement on the *recommended* treatment in 46% of cases. That is a very different number to 93% or even 73%.

What is the problem with including the secondary ‘for consideration’ recommendations from WFO? In principle nothing, as we are at this point only measuring process not outcomes, but it might need to be measured in a very different way than it is at present for the true performance to be clear.

Our problem is that if we match a human recommended treatment to a WFO recommendation it is a 1:1 comparison. If we match human recommended to WFO ‘for consideration’ it is 1:x. Indeed from the paper I don’t know what x is. How long is this additional list of ‘for consideration’ treatments. 2,5,10? You can see the problem. If WFO just listed out the next 10 most likely treatments, there is a good chance the human recommended treatment might appear somewhere in the list. If it had listed just 2 that would be more impressive.

In information retrieval, there are metrics that deal with this type of data, and they might have been used here. You could for example report the median rank at which WFO ‘for considerations’ matched human recommendations. IN other words, rather than treating it as a binary yes/no we would better assess performance by measuring where in the list the correct recommendation is made.

Now lets us turn to the results proper. I have already explained my concerns about the headline results around concordance.

Interestingly concordance is reportedly higher for stage II and III disease and lower for stage 1 and 1V. That is interesting, but odd to me. Why would this be the case? Well, this study was not designed to test for whether concordance changes with stage – this is a post-hoc analysis. To do such a study, we would likely have had to have seek equal samples of each stage of disease. Looking at the data, stage I cases are significantly underrepresented compared to other stages. So interesting, but treat with caution.

There are other post-hoc analyses of receptor status and age and the relationship with concordance. Again this is interesting but post-hoc so treat it with caution.

Interestingly concordance decreased with age – and that might have something to do with external factors like co-morbidity starting to affect treatment recommendations. Humans might for example, reason that aggressive treatment may not make sense to a cancer patient with other illnesses and at a certain age. The best practice encoded in WFO might not take such preferences and nuances into account.

Limitations: The authors do a very good job overall in identifying many, but not all, of the issues I raise above. Despite these identified limitations, which I think are significant, they still see WFO’s producing a “high degree of agreement’ with humans. I don’t think the data yet supports that conclusion.

Conclusions: The authors conclude by suggest that as a result of this study WFO might be useful for centers with limited breast cancer resources. The evidence in this study doesn’t yet support such a conclusion. We have some data on concordance, but no data on how concordance affects human decisions, and no data on how changed decisions affects patient outcomes. Those two giant evidence gaps mean it might be a while before we safely trial WFO in real life.

So, in summary, my takeaway is that WFO generated the same treatment recommendation as humans for a subset of breast cancers at a single institution in 46% of cases. I am unclear how much influence human input had in presenting data to WFO, and there is a chance the performance might have been worse without human help (e.g. if WFO did its own text processing).

I look forward to hearing more about WFO and similar systems in other studies, and hope this review can help in framing future study designs. I welcome comments on this review.

[Note: There is rich commentary associated with the original Twitter thread, so it is worth reading if you wish to see additional issues and suggestions from the research community.]

Differences in exposure to negative news media are associated with lower levels of HPV vaccine coverage

May 1, 2017 § 1 Comment

adam g. dunn

Over the weekend, our new article in Vaccine was published. It describes how we found links between human papillomavirus (HPV) vaccine coverage in the United States and information exposure measures derived from Twitter data.

Our research demonstrates—for the first time—that states disproportionately exposed to more negative news media have lower levels of HPV vaccine coverage. What we are talking about here is the informational equivalent of: you are what you eat.

There are two nuanced things that I think make the results especially compelling. First, they show that Twitter data does a better job of explaining differences in coverage than socioeconomic indicators related to how easy it is to access HPV vaccines. Second, that the correlations are strongest for initiation (getting your first dose) than for completion (getting your third dose). If we go ahead and assume that information exposure captures something about acceptance, and that socioeconomic differences (insurance, education, poverty, etc.) signal differences…

View original post 979 more words

#GottaCureEmAll – Pokemon GO teaches healthcare a big lesson

August 1, 2016 § 18 Comments

If we can believe what we are seeing, Pokemon GO is the world’s most effective, and most widespread, population weight loss intervention. Already, its users spend more time on the game than on other wildly popular mainstream social media platforms like Facebook, Snapchat and Twitter. Over the space of a few weeks, it has prompted millions of children and teens to get off the couch, turn off Netflix, leave the laptop in their bedroom, and walk out into the world to breath the fresh air. More than a few adults have done the same.

Healthcare should pay attention. While healthcare researchers are slowly coming to grips with ‘new’ ideas like gamification and social media to defeat obesity, the game industry has jumped the queue and may have already done it. Silicon valley has drawn down on its deep well of expertise in building large and complex software systems, and in embedding such systems into the real world. They have drawn on their deep experience with and understanding of the psychology of online social media, of what makes games ‘fun’, and what makes them ‘sticky’.

I doubt if Niantic, the Pokemon company, looked to randomized clinical trials to design and implement their system. The world of software moves too fast for that. It has an engineering culture of fail early, fail often. And because of that, it has as much right as scientists to claim that it is driven by experimentation and data, or as the philosopher Karl Popper would have said, conjecture and refutation.

For those who have not been drawn in to the world of Pokemon Go, it may be hard to understand what the fuss is all about. It is just another time-wasting, obsession inducing computer game. Yes it is interesting that it uses augmented reality and your physical location as part of the gameplay, but so what? People just walk around collecting different characters, oblivious to what is happening around them. The end result is a different kind of walking screen-time zombie, with the added risk of walking into the traffic or driving into a wall as you play the game.

There is another way to look at it. Firstly, irrespective of the game ‘medium’, the real world ‘message’ is that people are more than happy to exercise, and to engage with others in the real world, with the right motivation. For younger generations who have grown up in a world that is digitally augmented, the digital-social complex is the way to access their lives. Jogging with a fitbit is probably compelling for those who already run or are motivated to exercise. Pokemon GO does something more miraculous. It causes the Lazarus generation to rise up, and to move.

Pokemon GO makes walking the basic currency of the game. If you chance upon the eggs of Pokemon creatures, the only way to make them hatch is to walk a prescribed distance. Some eggs require 10k of nurturing before they crack. If you want to catch different Pokemon (and if you are a player, you #gottaacatchemall), then you will find spawning grounds in parks and open spaces. If you want to top up the items you need to catch Pokemon, then you have to walk from one Pokestop to another.

One of things that appears to make gambling ‘sticky’ is the uncertainty of reward. Each rare win reinforces the desire to keep trying for a bigger future reward. Pokemon GO has an interesting strategy of combining certainty in reward (eggs hatch after a defined distance is walked) with uncertainty (creatures appear unpredictably, and their behavior and value is unpredictable). As you progress in the game the rewards increase along with your status. Our brains are washed in an addictive dopamine broth with every reward, every step forward.

Pokemon GO also strives for social equity. When a creature appears in a given location, anyone who is there can see it and catch their copy of it. This means that there is real value in finding stronger players than yourself, because they will trigger the arrival of rarer creatures. These stronger players are also likely to have set lures to attract creatures, and the benefit of these lures is also socialized. Stronger players will have obtained their status by walking great distances, and so a sort of social modeling probably takes place influencing the behavior of newer players, further reinforcing the culture of movement.

Mass and spontaneous social congregation is an unexpected side-effect. Reports of many thousands of people all rapidly congregating in parks as rare creatures appear have been repeatedly reported. It is a sociologically fascinating emergent property of the game. It can also drive the locals crazy, blocking roads, and keeping people awake, as crowds chase the different creatures that appear at night. How wonderful if healthcare could trigger the same mass interest, with thousands queuing when Zika vaccination becomes available? Indeed, how can we mobilize such mass response for all sorts of health prevention activities?

The early uses of Pokemon GO in healthcare are examples of simple adoption. Placing Pokestops in the wards and surrounds of a children’s ward provides distraction and joy to hospitalized children who may be in dire need of fun. There are also of course the usual reactionary demands that the game be banned from clinical spaces .

Social media has already taught us a lot about how to deliver health services in new ways. The great potential for augmented reality in healthcare is yet to be tapped . Pokemon GO can teach us even more. We must learn to be more nimble and agile in the way we develop interventions to change behavior and deliver health services. The engineering worldview has much to offer, and it shares the same DNA of scientific reasoning so embedded in modern healthcare research. We are entering a time when more and more of the population will be embedded in an online social web, and that will be the universe in which we must engage with them, and where healthcare is delivered. And we must embrace that future, because in the end, we #GottaCureEmAll.

 

 

 

 

Making sense of consent and health records in the digital age

May 8, 2016 § 2 Comments

There are few more potent touchstones for the public than the protection of their privacy, and this is especially true with our health records. Within these documents lies information that may affect your loved ones, your social standing, employability, and the way insurance companies rate your risk.

We now live in a world where our medical records are digitised. In many nations that information is also moving away from the clinician who captured the record to regional repositories, or even government run national repositories.

The more widely accessible our records are the more likely it is that someone who needs to care for us can access them – which is good. It is also more likely that the information might be seen by individuals whom we do not know, and for purposes we would not agree with – which is the bad side of the story.

It appears that there is no easy way to balance privacy with access – any record system represents a series of compromises in design and operation that leave the privacy wishes of some unmet, and the clinical needs of others ignored.

Core to this trade-off is the choice of consent model. Patients typically need to provide their consent for their health records to be seen by others, and this legal obligation continues in the digital world.

Patient consent for others to access their digital clinical records, or e-consent, can take a number of forms. Back 2004, working with colleagues who had expertise in privacy and security, we first described the continuum of choices between patients opting in or out of consent to view their health records, as well as the trade-offs that were associated with either choice [1].

Three broad approaches to e-consent are employed.

  1. “Opt Out” systems; in which a population is informed that unless individuals request otherwise, their records will be made available to be shared.
  2. “Opt in” systems; in which patients are asked to confirm that they are happy for their records to be made available when clinicians wish to view them.
  3. Hybrid consent models that combine an implied consent for records to be made available and an explicit consent to view.

Opt in models assume that only those who specifically give consent will allow their health records to be visible to others, and opt out models assume that record accessibility is the default, and will only be removed if a patient actively opts-out of the process. The opt-out models maximises ease of access to, and benefit from, electronic records for clinical decision making, at the possible expense of patient privacy protections. Opt-in models have the reverse benefit, maximising consumer choice and privacy, but at the possible expense of record availability and usefulness in support of making decisions (Figure 1).

Untitled1Figure 1Different forms of consent balance clinical access and patient privacy in different proportions (from Coiera and Clarke, 2004)

All of the United Kingdom’s shared records systems now emply hybrid consent models of one form or another. Clinicians can also ‘break the glass’ and access records if the patient is too ill or unable to consent. In the US a variety of consent models are used and privacy legislation varies from state to state. Patients belonging to a Health Maintenance Organisation (HMO) are typically deemed to have opted in by subscribing to an HMO.

How do we evaluate the risk of one consent model over others?

The last decade has made it very clear that, at least for national systems, there are two conflicting drivers in the selection between consent models. Those that worry about patient privacy and the risks of privacy breeches favour opt-in models. Governments that worry about the political consequences of being seen to invade the privacy of their citizens thus gravitate to this model. Those that worry about having a ‘critical mass’ of consumers enrolled in their record systems, and who do not feel that they are at political risk on the privacy front (perhaps because as citizens our privacy is being so rapidly eroded on so many fronts we no longer care) seem comfortable to go the opt-out route.

The risk profiles for opt in and opt out systems are thus quite different (Figure 2). Opt-out models risk making health records available for patient’s who, in principle, would object to such access but have not opted out. This may because they were either not capable of opting-out, or were not informed of their ability to opt-out.

For opt-in models, the greatest risk to a system operator is that important clinical records are unavailable at the time of decision-making, because patients who should have elected to opt-in were neither informed that they should have a record, or were not easily capable of making that choice.

Other groups, such as those who are informed and do opt-out, may be at greater clinical risk because of that choice, but are making a decision aware of the risks.

risk

Figure 2: The risk profiles for opt-in and opt-out patient record systems are different. Opt-out models risk making records available for patients who in principle would object to such access, but were not either capable or informed of their ability to opt-out. For opt-in models, the risk is that important clinical records are unavailable at the time of decision making, because patients who should have elected to opt-in were neither informed nor capable of making that choice.

Choosing a consent model is only half of the story

In our 2004 paper, we also made it clear that choosing between opt-in or out was not the end of the matter. There are many different ways in which we can grant access to records to clinicians and others. One can have an opt-in system which gives clinicians free access to all records with minimal auditing – a very risky approach. Alternatively you can have an opt-out system that places stringent gatekeeper demands on clinicians to prove who they are, that they have the right to access a document, that audits their access, and allows patients to specify which sections of their record are in or out – a very secure system.

Untitled2

Figure 3 – The different possible functions of consent balance clinical access or patient privacy in different proportions. The diagram is illustrative of the balances only – thus there is no intention to portray the balance between access and privacy as equal in the middle model of e-Consent as an audit trail. (From Coiera and Clarke, 2004)

So, whilst we need to be clear about the risks of opt in versus opt out, we should also recognise that it is only half of the debate. It is the mechanism of governance around the consent model that counts at least as much.

For consumer advocates, “winning the war” to go opt-in is actually just the first part of the battle. Indeed, it might even be the wrong battle to be fighting. It might be even more important to ensure that there is stringent governance around record access, and that it is very clear who is reading a record, and why.

References

  1. Coiera E and Clarke R, e-Consent: The design and implementation of consumer consent mechanisms in an electroninc environment. J Am Med Inform Assoc, 2004. 11(2): p. 129-140.

 

 

What should a national digital health system look like?

May 1, 2016 § Leave a comment

What is the role of government in contributing to the nation digital health infrastructure? That is not an easy question to answer. Every nation has its own specific variant of a health system, with different emphases on the public or private, on central government intervention or laissez-faire commerce.  I have in earlier blogs made the point that, despite these differences in national systems, we now collectively have enough experience that we cannot ignore the evidence when crafting national strategies.

Back in 2009, when I explored the implications of these structural differences for government, I came to the conclusion that digital health needed a ‘middle out’ governance model, rather than top-down or bottom-up approaches to strategy. One consequence of the thinking in that paper was that I formed a view that we did not need a centralised national summary care record – a view which left me with fewer friends in government than I used to have! I was only trying to be helpful …

With a new Australian Digital Health Agency, it is now a good time to revisit these questions, to learn from the past, and to come together as an informatics and e-health community, and give ourselves the best possible shot at getting digital health right.

Digging through my papers recently, I came across this briefing paper I wrote for the Secretary of Health in 2008 – well before the middle out and summary care record papers. It was a time when Facebook was in the ascendancy, so I used the term ‘Healthbook’ to portray my ideas for a distributed, federated digital information system.  Maybe now is a good time to revisit its spirit, if not the technical details?

‘Healthbook’ – the consumer as catalyst for the creation of a national ehealth infrastructure

E. Coiera, 2 May 2008

Briefing paper to DOHA

Current situation

Australia like many nations is struggling to identify a strategic approach to creating a health information infrastructure that is technically feasible, low risk, and affordable.

The current proposal for a national shared electronic health record (SEHR), presumes a centralised, potentially monolithic, structure, where every Australian has a health record summary stored for them, to facilitate health care provision. The mental model is similar to English NHS’ system, which has cost billions of pounds to implement, and has experienced significant technical and implementation challenges on the way. If Australia were to take a similar centralised approach to the SEHR, then it too would cost several billion dollars, presuming our cost structures are similar to the English NHS, and face its own technical risks. And after investing that money we are locked into ageing technologies that require continued significant investment. Implementation starts, but it never ends.

A second disadvantage of beginning with a centralised SEHR is that it demands ‘delayed gratification’. There is massive up front investment, substantial pain within the health jurisdictions during implementation, with benefits only arriving after many years, and little for consumers to see or appreciate despite the large sums of money being invested. It also draws resources away from other cheaper, but potentially higher value, elements of the eHealth infrastructure, specifically decision support technologies, which have great capability to reduce harm, improve safety, and deliver efficiency gains through more evidence-based use of investigations and therapeutics.

A different way

An alternative approach has emerged. Imagine that, rather than waiting 5-10 years for a ‘centrally planned’ SEHR (that is what it may take) we achieve many of the same goals in less than 5 years, at significantly less cost to government, in a market-driven and industry lead way, growing organically and flexibly, rapidly adopting technological innovation, and potentially building up new export industries for Australia’s IT industry. Imagine also if this new way had strong support from consumers, because it was all about them and their health care, and not about putting in expensive ‘backroom’ technologies they will never see.

There are three elements to this approach:

1 – The shareable record can be consumer rather than health service focussed: Utilising the resources of private industry, consumer demand for access to their record, personal health records are emerging as a major new business sector. The strongest evidence for this is the move by two of the largest IT companies into this space. Microsoft has made its first major step into healthcare with its HealthVault product, and Google Health is emerging as their main competitor. Both offer consumers a service to store their personal health information, and to make it accessible to health providers with consumer consent.

In the US many large health service organizations have many millions of their patients using locally developed personal health records, for example the VA hospitals, and Partners. Similar activities are underway here with smaller start-up companies e.g. myvitals.com. Expect a flurry of such companies to appear locally, or arrive from overseas, over the next 12 months.

There is much to be commended about personal health records, but there are also some major limitations, including – the potential for the consumer created record to be of poor quality or perceive to be so by clinicians, the lack of interoperability between different systems, the consequent locking in of one’s records to a single vendor, the poor connectivity between health service provider records and personal health records, the significant risk that personal health information may be used for secondary and commercial purposes, and for Australian’s, the very real risk that core national IP – the health records of all Australians, is stored overseas – resulting in a massive transfer of information and wealth overseas.

2 – The rise of social computing. While there has been talk of the internet being an online community since the mid ‘90s, only in the last 2 years has this really taken off, with Facebook, My space and others providing a sophisticated social networking experience that has caught the imagination of the average consumer, trained consumers in sophisticated information sharing strategies, and developed software to support this. Consumers are now comfortable to carry out many of their most personal transactions on the web, from banking, to finding partners and socializing. Blogging has created a generation that is far more comfortable in sharing their personal information than any before.

3 – The continuing rise of search. Google and its competitors continue to prosper. Health information is amongst the top two categories of information searched for. Consumers want information about their health, and continue to turn more to the Internet for that information.

Putting these three together it may now be possible for private industry to create information services that challenge the centralized monolithic SEHR model, and create a rich and flexible ehealth infrastructure on the way.

The idea of a facebook for health (or ‘healthbook’) is fairly straightforward – it is a web space where you manage your health information and access health information services, in the same way that your internet banking account is the place you manage your wealth e.g. looking at account balances, paying bills, transferring funds. There will be many competing ‘healthbook’ systems provided by industry, and we can expect companies to be offering consumers at least some or all of the following services:

  1. A personal health record, where you enter your own health information;
  2. Access to health information e.g. search engines, local guidelines, drug information, health leaflets;
  3. A social computing environment in which a personal health record and information can be shared amongst family, friends, clinicians, and groups;
  4. Links to a selected subset of health providers, allowing them to see personal health records, exchange messages (reminders, appointments, results, health messages), and maybe allow you to see some of their records about you e.g. a division of GPs might offer this service, or a private health insurer may negotiate with health service providers to offer this to their clients.

It is important to emphasise that we are not saying that the personal record now becomes the shared health record – it cannot and should not – but that the links to different clinical record systems we might find in a ‘healthbook’ effectively provides the first stage in shared access to clinical records. While such systems will grow organically, and possibly quite quickly, there are several missing pieces and some concerns that need to be addressed, including:

  • Message exchange and access to your records stored by the public hospital system
  • Message exchange and access to your records stored by other health services not part of the particular online consortium you join.
  • Interoperability between systems, allowing consumers to take their personal health information, and linked messages and records, to a different provider.
  • Protections for Australian health information going overseas and being exploited for secondary commercial purposes.
  • Accreditation of healthbook providers to ensure clinical service providers and patients are comfortable in making their clinical records available via them.

If issues such as these were addressed quickly, we may in Australia be creating business conditions not yet operating anywhere else in the world, and create an opportunity for our local IT industry to corner or at least become highly competitive in a new business clearly destined to become the single largest information technology market.

It thus seems entirely feasible for government to choose not to invest in a monolithic national e-health infrastructure, but foster competition and rapid expansion of a web and business driven infrastructure. Government creates appropriate protections for the community and their personal information while supporting high quality and safe clinical care. Government is a key enabler, working with the professions and individuals to identify incentives and provides critical missing elements needed to fast track this world, including regulation, legislation, investment in making jurisdictional systems interoperable, provision of public knowledge and information sources, and investment in evaluation and research to drive evidence-based innovation.

What might happen next

If government steps in to address some of these barriers to fully interconnecting consumer-based personal health records, we could imagine three stages in the evolution of our national eHealth infrastructure:

Stage 1 (next 2 years) – Personal health record systems available and taken up by a few Australian. Some offer access to knowledge services e.g. Healthinsite; some service providers band together to allow their records to be linked to these systems and for messages to be exchanged between providers and consumers within this system. Records might be shareable within these restricted health service organizations. Standards are being developed by NEHTA, ISO and Standards Australia, and industry and the jurisdictions are moving to comply with these as they install eHealth systems.

Stage 2 (2-3 years) – Messaging standards and unique and secure IDs for every Australian (the UPI) are in place and allow communication between providers and any standards compliant ‘healthbook’. Record portability legislation encourages innovation and competition and avoids monopoly outcomes (similar to mobile telephone number portability, where a consumer can take their phone number and address book from one Telco handset and swap them to a different one). Some state jurisdictions and primary care divisions provide standard secure web interfaces to any accredited private system, and consumers chose to link to their records in these systems, if they are aware that they are able to. When viewing linked records they appear in non-standard ways, dependent on the structure of the local system the record sits on. 10% of Australians have a ‘healthbook’ page, with international IT companies amongst the major players, but Australians may end up trusting their health providers and government with their private information, so the biggest user base may be found with Divisions of general practice, or private health insurance companies. Many other players jockey for dominance.

Stage 3 (3-5 years) – Interoperability standards have allowed any accredited record provider to provide a discoverable web service, so that any healthbook can access these records, with consumer permission. This means when you create your new healthbook account and put in your UHI, the system will find all the records associated with your care that are on the web, and ask you if you want to link them in. When records are browsed from within a consumer space, they have a uniform appearance. So, irrespective of which company’s ‘Healthbook’ you use, a clinician can always find the information they want in the same place, by selecting the ‘common user interface’ option. It is possible to extract elements of provider records into a personal health record manually or automatically. For example, you can extract medication lists, test results, or allergies from your GP system into your personal health record.

For those who choose it, their treating clinician may decide which data gets extracted from the clinical record into the personal summary record. For Australians who are not interested in using a private system, or are unable to do so, a ‘vanilla’ personal health record is made available, possibly via the jurisdictions, that allows a provider to see other linked records for a given patient, with a patient’s consent. Local Australian companies provide the back end service to consumer health sites, with the front end run by large health delivery organizations e.g. public hospital systems, and private insurers. International IT companies provide some of the core technologies underpinning these systems but the data is stored in Australia, protected by legislation from going offshore, or even analyses of the data going offshore.

The Role of government

Government has a role to:

  • Facilitate – through standards activities (NEHTA) and early investment for industry development and research. For example COAG may wish to provide seed funding for 2-4 large-scale implementations e.g. requiring each consortium to include a public hospital system, a primary care organization, and for some % of the industry membership to be locally based. This attracts industry to invest, and creates a competitive climate in which innovation is focussed on delivering to the consumer as the main customer. It should be clear investment is for start up and that all programs need to be self-funding at the end of the projects. There may be incentives for meeting subscription and transaction rate milestones, and for health services incentives for meeting outcome targets e.g. preventative health activities. There may be penalties for failure to deliver, including withholding of payments should benchmarks not be met. There should be some key deliverables that we expect of out any such consortia, including:
    1. Working with standards organizations like NEHTA, they should agree on a working record portability standard and mechanism, that allows a consumer to extract their personal health record, provider messages, links to clinical records, and any other information such as a future shared health record, and transfer it to another provider;
    2. Consortia should demonstrate interoperability between each other for record mobility between consortia, and for messaging between providers and different consortia.
    3. Working with standards organizations, the consortia should agree on a default ‘common user interface’, which provides a uniform way of accessing linked records, messages, and patient data for clinicians and consumers. There is no obligation to use this interface as different systems will want to ‘value add’ and provide better user experiences for their customers. We want to ensure that clinicians will only need to learn how to access healthbook records once, and always find the information they need in the same place every time – for safety as well as efficiency reasons.
    4. Demonstrated use of a unique personal identifier like the UHI, ensuring secure and safe creation of new accounts, protection of personal information, and ease of access in clinical situations.
    5. Demonstrated security and consent mechanisms so that consumers feel safe using these systems.
  • Protect – the privacy of individuals, and the national IP – through legislation, and where appropriate accreditation. Consumers will need record portability and not be locked into one vendor, so legislation should allow for consumers to extract their digital records from any one vendor and move to another. Consumers and providers will want to know that healthbook systems are accredited before records are linked into them, and that accreditation ensures that records made available this way are not used for any purpose other than clinical care, and only with the consent of consumers.
  • Evaluate – We need benchmarks for this program, both in terms of uptake by citizens, as well as adoption rates, usage and benefits. Evaluation programs for benefits are best run by independent organizations, and this is a clear role for academic institutions.
  • Ensure Access – Ensuring all citizens and health service providers have access via a decent broadband system, and for those citizens who choose not to actively be engaged, or are unable e.g. infirm, elderly, then create an option of clinician or health service managed e-services where the consumer gives permission for their ‘healthbook’ to be created for them. Facilitate early adoption by service providers with an incentives program (e.g. to make practice records linkable to commercial systems).
  • Innovate – We want Australian industry to have access to new ideas and IP to make them competitive with the US industry in particular, and there is a clear opportunity to support Australian R&D and innovation with targeted support for eHealth innovation programs.
  • Participate – where jurisdictions control medical content such as records or knowledge resources (Healthinsite, service or provider directories), make these available and interoperable with private sector systems. Where government has a specific duty to individuals such as military personnel, provide or auspice services available to citizens e.g. military personnel may have records that cannot be linked for security reasons to commercial systems, so a military system might be needed, which links to all public records, but remains secure.

Appendix – Some benefits and ideas worth capturing at this stage

Benefits of this approach

  • A better informed, better engaged population
  • A transition plan to implementing SHER functions, not a ‘big bang’ centralised SHER, which is a single point of failure if things go wrong.
  • Technical and investment risks are lower, as the elements government may want to invest in e.g. standards, making jurisdictional records compliant, and messaging are all required under the monolithic SEHR model too. So, if the consumer-drive model does not work, government can in the future elect to step in and can complete the ‘last mile’ e.g. with health information exchanges.
  • Most of the implementation risk is borne by private enterprise
  • A shift to preventative healthcare, as consumers build for possibly the first time a place where they actively manage their healthcare, and receive targeted messages and support.
  • Safer care – driven by consumer benchmarking and rating, the use of consumer decision support systems, easier interaction with clinicians via messaging, a shareable record that allows clinicians to see the bigger clinical picture.
  • Support for the Australia it industry and research community to become a world leader in a market that is highly lucrative – if there is to be a new company that becomes the Google of healthcare, why could it not be an Australian company?

Ideas

  • Use the healthbook to send reminders for vaccinations, screening tests, routine check ups.
  • Support for healthy journeys e.g. parents with young children accessing information at crucial child development stages, and possibly linking up with the community 1-stop shop proposal by government.
  • If every high school student has a computer why can’t they use ‘healthbook’ applications to manage their exercise and eating regimes, by providing a online social environment where quality information is shared, groups can form e.g. how to cope with anorexia or obesity, providing information and social support?
  • Support for more targeted, efficient access to services e.g. by providing consumers health service directories, similar to ‘choose and book’ in the NHS, with the ability to identify providers, and make appointments. Especially valuable for rural and remote citizens to identify services that might be available to them outside of local area.
  • Consumer based benchmarking of services – similar to Amazon star rating for books (this will happen anyway – best to support it being as informative and balanced as possible).

 

Four futures for the healthcare system

February 20, 2016 § 1 Comment

That healthcare systems the world over are under continual pressure to adapt is not in question. With continual concerns that current arrangements are not sustainable, researchers and policy makers must somehow make plans, allocate resources, and try to refashion delivery systems as best they can.

Such decision-making is almost invariably compromised. Politics makes it hard for any form of consensus to emerge, because political consensus leads to political disadvantage for at least one of the parties. Vested interests, whether commercial or professional, also reduce the likelihood that comprehensive change will occur.

Underlying these disagreements of purpose is a disagreement about the future. Different actors all wish to will different outcomes into existence, and their disagreement means that no particular one will ever arise. The additional confounder that predicting the future is notoriously hard seems to not enter the discussion at all.

One way to minimize disagreement and to build consensus would seem to be to have all parties come to a consensus of what the future is going to be like. With a common recognition of the nature of the future that will befall us, or that we aspire to, it might becomes possible to work backwards and agree on what must happen today.

Building different scenarios to describe the future

There seem to be two major determinants of the future. The first is the environment within which the health system has to function. The second is our willingness or ability to adapt the health system to meet any particular goal or challenge. Together these two axes generate four very different future scenarios. Each scenario has a very different set of challenges to it, and very different opportunities.

4scenarios
Making Health Services Work: In this quadrant, we are blessed with relatively stable conditions, and even though our capacity or will for change is modest, we can embark on incremental changes in response to projected future needs. We focus on gentle redesign of current health services, tweaking them as we need. The life of a heath services researcher is a comfortable one: no one needs or wants a revolution and there is time and resource enough to solve the problems of the day.

New Ways: Despite forgiving and stable times, in this quadrant we have an appetite for major change. Perhaps we see major changes ahead and recognize that incremental improvements will be insufficient to deal with them. Maybe we see future years with demographic challenges such as clinical workforce shortages and the increasing burden of disease associated with an ageing population. Consequently, more radical models of care are developed, evaluated and adopted. Rather than simply retro-fitting the way things are done, we radically reevaluate how things might be done, and envisage new ways of working, and conceive new ways to deliver services.

Turbulence systems: The risk of major shocks to the health system are ever present, including pandemics, weather events of ‘mass dimension’ associated with climate change, and human conflict. It is possible to make preparations for these unstable times. We might imagine that we set about to design some capacity for ‘turbulence’ management into our health services. Such turbulence systems would help us detect emerging shocks as early as possible, and would then reallocate resources as best we can when they arrive. The way that global responses to disease outbreaks has rapidly evolved over the last decade shows what is possible when our focus is on shock detection and response. Similar turbulence systems are evolving to respond to natural disasters and terrorism – so there are already models to learn from. In this quadrant then, we redesign the health system to be far more adaptive and flexible than it is today, recognizing that the future is not just going to be punctuated by rare external shocks, but that turbulence is the norm, and any system without shock absorbers will quickly shatter.

All hands on deck: In this scenario, health services receive major shocks in the near future, and well ahead of our ability to plan for these events. For example a series of major weather events or a new global pandemic could all stretch today’s health system beyond its capability to respond. Another route to this scenario in the long term is to not prepare for events like global warming or infectious disease outbreaks or an ageing population, and because of disagreement, underinvestment or poor planning, we do nothing. If such circumstances arrive, then the best thing that everyone in the health system can do is to abandon working on the long term, and apply our skills wherever they are most needed. In such crisis times, researchers will find themselves at the front lines, with a profound understanding and new respect for what implementation and translation really mean.

Picking a scenario

Which of these four worlds will we live in? It is likely that we have had the great good fortune over the last few decades of living in stable and reactively unambitious times, tinkering with a system that we have not had the appetite to change much. It seems likely that instability will increasingly become the norm however. I don’t think we will have the luxury of idly imagining some perfect but different future, debating its merits, and then starting to march toward it. There will be too much turbulence about to ever allow us the luxury of knowing exactly what the right system configuration will be. If we are very lucky, and very clever, we will increasingly redesign health services to be turbulence systems. Even if the flight to the future is a bumpy one, the stabilizers we create will help us keep the system doing what it is meant to do. ‘All hands on deck’ is the joker in the pack. I personally look forward to not ever having to work in this quadrant.

[These ideas were first published in a paper my team prepared back in 2007, and since it first appeared, the turbulence has slowly become more frequent …]

 

%d bloggers like this: