A day at the ER: How many diagnoses can AI correctly make?

In early 2023, the news that the chatbot ChatGPT could pass exams at law and business schools and even the US Medical Licence Examination (USMLE) raised very high expectations for some, and concerns for others.

The researchers tested ChatGPT with 350 of 376 publicly available questions from the previous year's USMLE. The USMLE consists of three sub-exams in which the bot scored between 52.4 and 75% - so performance was close to or around the pass mark of 60%.¹ Does this mean the AI tool could help us make clinical decisions?

"The results were fascinating, but also quite disturbing"

At the end of his regular shifts in an emergency department, Dr Tamayo-Sarver anonymised the data of 35 to 40 patients and entered his notes from the patient history (medical history and current clinic) into ChatGPT 3.5. When asked "What are the differential diagnoses for this patient presenting to the emergency department with xy (insert notes here)?", the AI-TooI did a good job of naming important and common diagnoses. However, this only worked well if the clinical picture was typical and the input was precise and very detailed. For example, a correct diagnosis of radial head subluxation (nanny's elbow) required about 200 words of input; a blow-out fracture of another patient's orbital wall required all 600 words of the doctor's notes.

For about 50% of patients, the average of six diagnoses suggested by ChatGPT included the correct one (the one the doctor came up with after completing all his investigations). Not bad, but again not a good hit rate for an emergency department, according to Dr Tamayo-Sarver.^2,3

The problem: many patient cases are not "textbook"

The 50% success rate also meant that life-threatening conditions were often overlooked. For example, a brain tumour was correctly suspected in one patient, but this possibility was completely overlooked in two other patients with tumours. Another patient presented with pain in the trunk and ChatGPT would have diagnosed a kidney stone. However, there was an aortic rupture, from which he died intraoperatively.

The worst system failure was a 21-year-old female with right lower quadrant pain. The bot immediately indicated appendicitis or an ovarian cyst, among other possibilities. In reality, there was an ectopic pregnancy - a diagnosis that can be fatal if recognised too late. Every medical student learns that an acute abdomen in a woman of childbearing age must be followed by an investigation to determine whether a pregnancy could exist. Fortunately, Dr Tamayo-Sarver promptly thought of this (and, as is so often the case in clinical practice, this patient did not expect to be pregnant).^2,3

ChatGPT, however, did not raise the possibility of pregnancy in any response and would not have asked about it at all. And for Tamayo-Sarver, this is one of the most important limitations: Tools like ChatGPT can only answer things you ask about in the first place. If you are on the wrong track, the AI tool will reinforce this bias by continuing to feed back information that matches its own observations and inputs. A bias caused by an overlooked question or misconception is further amplified by such tools.

Do we need a more sober look at the possibilities of AI?

ChatGPT was able to suggest some good differential diagnoses, but only if it was fed with perfect information and the clinical presentation of the disease was absolutely classic. This will also be the reason why the AI tool was able to pass 60% of the case vignettes of the state exam: "Not because it is "smart", but because the classic cases in the exam have a clear answer that already exists in the database," says Tamayo-Sarver. The USMLE is a test of memorisation, not judgement.

According to Tamayo-Sarver, the art of medicine is primarily about recognising the right narrative or the relevant information. He fears that countless people are already using ChatGPT to diagnose themselves instead of seeing a doctor. ChatGPT provides answers and information that seem excellent to people who are not experts in the field. If the young woman with the ectopic pregnancy had done this, it could have ended in an internal haemorrhage for her.

"In the meantime, we in Silicon Valley and the general public urgently need a much more realistic view of what AI can do today - and of its many, often dangerous, limitations. We need to be very careful to avoid exaggerated expectations of programmes like ChatGPT, because in the context of human health, they can literally be life-threatening," concludes Tamayo-Sarver's field report.^2,3

Tran, T. H. ChatGPT Passed a Notoriously Difficult Medical Exam. The Daily Beast.
I’m an ER doctor: Here’s what I found when I asked ChatGPT to diagnose my patients. Medium.
Tamayo-Sarver, J. I’m an ER doctor: Here’s what I found when I asked ChatGPT to diagnose my patients. Fast Company.

A day at the ER: How many diagnoses can AI correctly make?

ChatGPT could solve many questions on a medical exam, but using it on real people would quickly prove fatal, says a doctor who has put it to the test.

ChatGPT's success raises high expectations...rightly so?

"The results were fascinating, but also quite disturbing"

The problem: many patient cases are not "textbook"

Do we need a more sober look at the possibilities of AI?