ChatGPT: When AI answers medical MCQs

This "Large Language Model" can generate text. Is this a simple memorization or the beginning of reasoning? When confronted with medical MCQs however, LLMs does show limits.

About the author: Marc Cavazza is a French physician and PhD in biomathematics. He has led research teams in several British universities, focusing on brain-computer interfaces and applications of Artificial Intelligence. He has published in most of the international conferences on the subject (IJCAI, AAAI, ECAI, ICML, NeurIPS).

Translated from the original French version.

Press and media have been flooded in recent days with articles announcing the revolutionary new capabilities of artificial intelligence (AI). The latest is its supposed ability to answer all sorts of questions intelligently and to generate texts of enough quality to fool teachers. These are LLMs (Large Language Models), which go by the acronyms ChatGPT, GPT-3.5, and soon, GPT-4.

An LLM is a statistical model that has learned from a phenomenal amount of text (such as the entirety of Wikipedia), using an unsupervised deep learning mechanism (i.e. it can learn on raw text, without it requiring annotations describing its content or properties). LLM uses a novel learning technique, introduced in 2017: Transformers (hence the acronym GPT, which stands for Generative Pre-trained Transformer).

In simple terms, an LLM learns a set of probabilities that determine which sequence of words should "respond" to a sentence expressing a query or question. Seen abstractly, an LLM is therefore a system that generates text in response to other text. But, unlike a simple search engine, the text produced is not just a copy and paste of text existing somewhere on the net; it is an original, realistic and grammatically correct output.

LLMs, more than just a "digital parrot"?

The power of LLMs derives from the fact that a large number of applications can be reduced to this mechanism of generating text from a request (the "prompt"):

Using up to hundreds of billions of parameters, LLMs are extremely complex. The learning phase requires phenomenal computing resources, beyond the reach of a medium-sized research laboratory. This explains why most LLMs are produced by major digital industries or by foundations supported by them, such as OpenAI.

These LLMs are the subject of controversy and philosophical questions. First of all, are they really a form of self-understanding, or are they just 'stochastic parrots'? A more technical question is raised: are elementary forms of reasoning accessible to the LLM (or not), if they are only using text-based learning?

LLMs take on Medicine Multiple-Choice Questions

No less than three articles1-3 from respectable institutions such as Google or MIT have publish the results of experiments using LLMs to answer Medical School questionnaires; more specifically, MCQs.

This could be seen as having several intentions, such as validating the pontification that Medicine is after all only "a memorization process", or an opportunity to prove that LLMs can learn Medicine better than physicians. In any case, this seems an opportunity to highlight a feat - solving a complex cognitive task - and perhaps taking on an old rival like IBM and its Watson Health subsidiary, which has had some setbacks in recent times.

My aim is not to indulge in a full-scale critical reading of articles that are, after all, quite technical; I do have an idea on the subject and, spoiler alert, I will say more about it towards the end of this article. I would like to encourage readers to take an uninhibited interest in this subject. Even though this research is at the cutting edge of one of the most complex fields, I hope that it is possible to get into the problem at hand regarding LLMs and the ChatGPT question without any advanced knowledge of Machine Learning, and with just our instinctive problem-solving capacities.

Indeed, physicians have not only suffered through thousands of MCQs during their studies and trainings, but they have also developed a metacognitive capacity in their approach to medical problems, through awareness of differential diagnosis, of the steps to take, and even, depending on the discipline, of pathophysiological reasoning.

This allows us to approach the problem of LLMs in medicine via two aspects, which are central to any discussion on the application of AI in clinical medicine: evaluation and explanation.

Beyond memorization?

On the evaluation part, you can get an idea of the quality of the answers by considering the difficulty of the MCQs. When comparing the system to human performance, it is important not to be intimidated by the multiple metrics that Machine Learning is so fond of, especially in the absence of data on the distribution of MCQs in terms of difficulty. As for the explanation, it is of course dependent on the level of reasoning required by the MCQ, and it is up to you to judge whether or not it reproduces plausible medical reasoning.

Let's skip the debate on the value of MCQs as an assessment in Medicine. Let us accept their importance, at least as a methodological tool, while recognising that not all MCQs offer the same level of difficulty, sophistication or even quality.

We can consider that there are several cognitive strategies for answering a MCQ: simple memorization, differential diagnosis, pathophysiological reasoning, to which we can add default strategies derived from the structure of the MCQ itself (answer by elimination, or only possible answer).

The MCQs used in this work on LLMs are taken from several American databases such as the United States Medical Licensing Examination (USMLE). They are single-answer MCQs. It can be considered quite natural that LLMs can answer all these MCQs. For example: to identify out of four possibilities of the first clinical sign in tetanus or botulism, or the examination to be considered in an emergency in a case of suspected malaria.

In a way, this is already a very interesting result. Take, for example, the following MCQ3 from a dataset used to test LLMs (the Massive Multitask Language Understanding database):

Question:
Which of the following controls body temperature, sleep, and appetite?
Answer:
(A) Adrenal glands (B) Hypothalamus (C) Pancreas (D) Thalamus

If we look at the titles of the three articles cited, they seem to want to go further, by explicitly talking about clinical knowledge or even reasoning. Is the goal to acquire clinical knowledge? Yes, but which one, and to what extent? Can we reasonably believe that knowledge is an autonomous entity, and that to be a physician it would be enough to know the Harrison’s Principles of Internal Medicine by heart?

Unsatisfactory explanations

Could the LLMs really reason? Even to the point of explaining their choice, which seems to be a prerequisite for any deployment, even in tandem with a physician? One of the three articles2 proposes to justify the "reasoning" of the system, which is a commendable effort at transparency. Unfortunately, it becomes clear very, very quickly, that the explanations offered are rather shaky, without even getting into overly technical AI considerations.

I invite the reader to look at the examples, and make up your own mind: you will quickly be able to spot errors or oddities and propose a more satisfactory explanation (than the one offered by the example shown in these articles) both in your speciality and in general.

Finally, one may be puzzled by the ability of LLMs to answer, and explain they answres to MCQs that require more complex reasoning. Especially when the MCQ itself can be perplexing...For example, the following is an MCQ from the USMLE3 database for which the correct answer is supposed to be (A):

Question:
A 65-year-old man with hypertension comes to the physician for a routine health maintenance examination. Current medications include atenolol, lisinopril, and atorvastatin. His pulse is 86/min, respirations are 18/min, and blood pressure is 145/95 mm Hg. Cardiac examination reveals end diastolic murmur. Which of the following is the most likely cause of this physical examination?

Answer:
(A) Decreased compliance of the left ventricle
(B) Myxomatous degeneration of the mitral valve
(C) Inflammation of the pericardium
(D) Dilation of the aortic root
(E) Thickening of the mitral valve leaflets

As stated at the beginning of this article, I do have an opinion on the matter. Let's just say that I would have found it preferable if the titles of the articles had insisted on the possibility of answering certain MCQs with a textual model, rather than claiming to acquire or code clinical knowledge sui generis. On this last point, there are both practical and theoretical objections.

In practice, we still see quite a few shaky answers, at least when the authors have the courage to seek to justify their answers. In theory, there is a very active debate about the fact that LLMs can reproduce reasoning, which was paradoxically easier with the old AI that was strongly based on logic (note that I am not saying that human reasoning is based on formal logic).

The consensus is rather that at present, LLMs can only reproduce fairly trivial reasoning. Thus, Yann Le Cun4 compares them to "students who have learned the material by rote but haven't really built deep mental models of the underlying reality.” It is conceivable that LLMs could reproduce reasoning that would be found verbatim in some of the texts they have learned about, such as simple syllogisms. But there is no certainty about their ability to generalise content.

The introduction of Chain of Thoughts (CoT)2 is intended to give a semblance of coherence to the LLM results. However, as a formalism, it falls far short of previous models of reasoning in symbolic AI and looks more like a hack than a real theory.

This is a far cry from hypothetical-deductive reasoning. It is precisely this type of reasoning that characterises differential diagnosis. As for pathophysiological reasoning, there are no examples of this in these articles. Moreover, it seems difficult to find in a text-text model the granularity of reasoning that allows it to be applied to clinical cases that are by definition very specific, without the risk of confusion between situations.

Before claiming that they can discover clinical knowledge, LLMs should therefore be assessed with MCQs requiring underlying reasoning, as we all have been. They should also be able to produce a non-trivial justification for the answers provided. It is not in the least paradoxical to note that when AI tries its hand at medicine, we finally realise that medicine is not just learning and repeating...

References
  1. Jin, D., Pan, E., Oufattole, N., Weng, W.H., Fang, H. and Szolovits, P., 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), p.6421.
  2. Liévin, V., Hother, C.E. and Winther, O., 2022. Can large language models reason about medical questions?
  3. Singhal, K., Azizi, S., Tu, T., Mahdavi, S.S., Wei, J., Chung, H.W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S. and Payne, P., 2022. Large Language Models Encode Clinical Knowledge.
  4. French artificial intelligence researcher Yann Le Cun is considered one of the inventors of deep learning. His work focuses in particular on artificial vision, artificial neural networks and image recognition. He heads the artificial intelligence research laboratory at Facebook.