Prediction of Multidisciplinary Meeting

Introduction

In collaboration with the Interventional Radiology Department at the Georges Pompidou European Hospital (HEGP), we worked closely with Dr. Tom Levy-Boeken to develop a predictive model for multidisciplinary case review outcomes, known in France as Réunion de Concertation Pluridisciplinaire (RCP).

What is an RCP and Why is it Important?

An RCP (Multidisciplinary Case Review) is a crucial meeting where healthcare professionals from different specialties come together to discuss a patient’s case. This collaborative approach ensures a precise diagnosis and helps define an optimal treatment plan by leveraging the expertise of each specialist. These meetings play a fundamental role in complex medical decision-making, particularly in specialized fields such as interventional radiology, where advanced imaging techniques and procedures influence treatment strategies.

However, RCP outcomes are unique to each hospital. The decision-making process is influenced by various factors, including the hospital's available resources, equipment, medical expertise, and protocols. As a result, the same patient case might lead to different decisions in different hospitals, making it challenging to standardize recommendations across healthcare institutions.

Project Goal

Our objective was to develop a Natural Language Processing (NLP) model capable of classifying and predicting the outcome of an RCP report. Trained on HEGP’s data, this model could then serve as a decision-support tool for other hospitals, providing insights into how a leading interventional radiology center like HEGP would handle a specific case.

This project serves as a proof of concept to demonstrate how AI-driven NLP models can assist physicians in making informed decisions by analyzing past RCP reports. Ultimately, the goal is to extend this system to other French hospitals, allowing them to compare their decisions with HEGP’s expertise and facilitating more data-driven, consistent, and well-informed medical decisions.

view github

Data

The dataset consists of multidisciplinary case review (RCP) reports (“Q”), along with the final decision made by the medical team at the end of the meeting (“A”), which is classified into one or more categories (“C”). All files are provided in DOCX format, reflecting real-world medical documentation practices. This structure mirrors a typical question-answer format, where the “Q” represents the full medical case discussion, and the “A” encapsulates the medical decision reached. The classification (“C”) helps categorize outcomes, making the dataset suitable for supervised learning approaches.

However, despite this seemingly structured format, the content itself is highly unstructured and inconsistent. The RCP reports are compiled from multiple observations made by different specialists over time, leading to significant variability in wording, phrasing, and organization. Since each report is a synthesis of multiple medical opinions, it does not follow a strict template, making information extraction particularly challenging. The accompanying medical case files that describe a patient’s history are also disorganized, with key details appearing in different locations across reports. Moreover, since different physicians contribute to these notes, terminology and writing styles vary, further complicating automated processing. Additionally, a single patient may undergo multiple RCPs at different time points, with each session leading to a new decision based on the patient’s evolving condition.

Another challenge arises from the fact that all reports are written in French. Many existing NLP models, especially those designed for medical text analysis, are pre-trained on English-language datasets. Applying these models directly to our dataset requires significant adaptation, including fine-tuning on domain-specific French medical texts. The combination of language complexity, lack of standardized structure, and the presence of multiple RCPs per patient makes this dataset a prime candidate for advanced NLP techniques. Rule-based methods alone are insufficient, and robust machine learning approaches must be employed to extract key insights, structure the data, and predict RCP outcomes. The ultimate goal of this project is to develop an AI model capable of assisting medical professionals by analyzing RCP reports and suggesting likely decisions based on historical data from Georges Pompidou European Hospital. By proving the feasibility of such an approach, this research could pave the way for similar decision-support tools in other hospitals, adapted to their specific resources, medical expertise, and clinical workflows.

Challenges of the data

The dataset presents several challenges related to data inconsistency, anonymization, and labeling, all of which complicate the task of training an effective NLP model.

One of the most significant issues is the lack of uniformity in the data. RCP reports are written by different physicians, each with their own style, structure, and level of detail. Unlike structured medical records that follow a standardized format, these reports are free-text summaries of patient cases, making it difficult to extract key information in a consistent manner. Furthermore, the way information is presented varies significantly—some reports are highly detailed, while others provide only minimal descriptions of the case. Additionally, since RCPs are cumulative discussions that evolve over time, a single report may contain redundant or overlapping information from previous meetings, which could lead to data leakage or unnecessary repetition in the model’s training process.

Another major challenge is data anonymization. The original RCP reports contain sensitive patient information, including names, specific dates, and highly detailed medical histories. Before using this data for model training, it must undergo rigorous anonymization, a process currently being handled by our collaborator, Dr. Tom Levy-Boeken. However, even after removing explicit identifiers, some medical details remain highly specific, making it difficult to generalize the model’s predictions beyond the Georges Pompidou European Hospital. The model could learn patterns too closely tied to the particularities of this hospital’s patient population, reducing its effectiveness when applied to other hospitals with different medical practices and resources.

Finally, labeling poses an additional challenge. Unlike traditional classification problems where labels belong to predefined categories, the decisions made during RCPs (“A”) are full sentences rather than discrete classes. This makes it difficult to directly map decisions to a fixed set of categories. Furthermore, a single decision can implicitly contain multiple medical classes, meaning that a single label might represent multiple overlapping recommendations. This complexity requires careful preprocessing and potentially the development of specialized NLP techniques to extract structured information from these decision texts.

Hypothesis and Solution

The hypothesis underlying this project posits that the characteristics leading to the final decision in a Réunion de Concertation Pluridisciplinaire (RCP) are not easily identifiable and vary based on each individual case. RCP outcomes are influenced by a multitude of factors, including the diversity of medical opinions, varying patient conditions, and the hospital’s resources and protocols. As such, the key features that drive the medical decision-making process in these meetings are complex, heterogeneous, and not straightforward to extract.

To address this, we propose training a large language model (LLM) on a broad range of RCP reports. By leveraging the full spectrum of RCP discussions across different cases, we aim to enable the model to learn from diverse medical scenarios, capturing the nuanced relationships between the information discussed and the final decision. This approach aims to uncover patterns in the reports that are not immediately obvious to human analysts and to identify common factors that influence decisions. The goal is to build a predictive tool that can generalize across cases, supporting medical professionals in making well-informed, data-driven decisions.

Pre-process

Preprocessing of the RCP data is a crucial step to ensure that sensitive information is protected and that the data is ready for model training. First, rigorous anonymization is carried out on the RCP reports, which includes the removal of all personal information, such as names, surnames, job titles, and any other details that could identify a patient or a physician. This step is essential for maintaining confidentiality and ensuring that the model cannot infer any personal information.

Next, the RCP reports are translated into English, as language models, particularly those based on GPT-4, perform better with English text. This translation improves the model's accuracy, as GPT-4 is better suited for processing English-language data due to its advanced natural language processing capabilities. The translation is done using the GPT-4 API, ensuring consistency and high quality.

Finally, manual categorization of the labels and sub-labels is performed to structure the decisions made during the meetings. The labels represent the final outcomes of the discussions, while the sub-labels specify additional details or recommendations associated with each decision. This categorization follows a precise classification table, aligning the different medical decisions with coherent categories for the model, thereby facilitating supervised learning.

Labeling Issue

In the dataset, some class/subclass pairs are overrepresented, while others are completely absent. This imbalance presents a challenge for the model, as it may lead to biased predictions towards the more frequent classes. To address this, it becomes necessary to simplify the problem by consolidating the labels into more general categories.

For each label/sub-label pair, a number is assigned, and the distribution of these classes across the dataset is closely observed. The labeling process is performed manually by the physician, based on the original results of the RCP reports. This manual assignment ensures that the labels reflect the medical decisions accurately, but it also contributes to the inconsistencies in the distribution of these classes, requiring further adjustments during model training.

Data Augmentation

In our case, the limited number of RCP reports available presents a significant challenge for training a robust model. To overcome this limitation, data augmentation is crucial to generate synthetic data that can simulate diverse RCP scenarios. By leveraging the GPT-4 API, we can reformulate and generate new RCPs, enriching our dataset and improving the model's ability to generalize.

Step 1: The first step involves removing elements that are too specific to individual cases, such as patient dates, age, and recovery times. These details may inadvertently reveal sensitive information or introduce biases into the model. Additionally, we remove random paragraphs to simulate the natural variability found in RCP reports.

Step 2: The next step involves adding artificial dates and creating new types of patients that are similar in age to the original ones. This ensures that the newly generated RCPs are realistic while maintaining diversity. The text is then summarized and synthesized, creating a new but plausible medical case.

Step 3: Finally, we perform random augmentation by increasing the size of the RCP summaries. This includes adding superfluous sentences (noise) that mimic the informal and unstructured nature of real-world RCPs, ensuring that the model is exposed to a broader variety of possible report formats and medical discussions.

By employing these automated augmentation steps with GPT-4, we significantly expand our dataset, enabling the model to learn from a more diverse range of scenarios and making it more capable of handling the inherent variability in RCP reports. This synthetic data generation process is crucial for enhancing the model’s robustness and improving its predictive capabilities in a real-world medical setting.

To maximize the amount of synthetic data generated from a single RCP, we execute a series of three successive calls to the GPT-4 API for each of the above steps. We randomly change the prompts for each call to ensure that the generated results differ for every synthetic RCP. This variability in the generated data enhances the diversity of the dataset and makes the model more robust to different phrasing and medical case discussions.

Dataset

For all RCPs across every class, we created 50 synthetic RCPs based on randomly generated prompt series. This allowed us to expand the dataset and introduce a wider range of variations for model training. The synthetic RCPs were designed to be as diverse as possible by altering various aspects of the original reports while keeping them realistic and medically plausible.

To ensure that the model evaluation is unbiased, we carefully split the dataset into a training set and a test set. We made sure that each set contained only original or synthetic RCPs, but not both. Specifically, for the test set, we included both the original RCPs and their corresponding synthetic versions to evaluate how well the model generalizes across real and generated data. However, the training set only contains original RCPs and their synthetic counterparts are not included, preventing any potential data leakage or bias during the model's training process. This division ensures that the synthetic data does not influence the model’s learning during training, and it is only used for testing purposes to evaluate the model's performance on unseen, generated cases.

Binary Classification Approach

Given that the project is a proof of concept and the dataset exhibits a highly imbalanced distribution across the five different classes, we initially decided to simplify the problem by training the model for a binary classification task. Specifically, we focused on classifying RCP outcomes into two categories: "invasive treatment" (class C-E) and "non-invasive treatment" (class A-B-D).

This approach was chosen for several reasons. First, the imbalance in the dataset would have made it difficult to effectively train a model on all five classes, as some classes were underrepresented, leading to potential bias in predictions. By grouping the classes into two broader categories—one for invasive treatments and one for non-invasive treatments—we aimed to balance the data and create a more manageable problem for our proof of concept.

This binary classification task allows us to evaluate the model's ability to differentiate between two fundamentally distinct treatment approaches, which are clinically significant and widely applicable in medical decision-making. After successfully demonstrating the feasibility of the binary classification model, we can explore more complex, multi-class classification tasks with the expanded dataset.

Fine-Tuning a Pre-trained Model

Fine-tuning refers to the process of adapting an already pre-trained model to a specific task by further training it on a smaller, task-specific dataset. Rather than training a model from scratch, which requires a massive amount of labeled data and computational resources, fine-tuning allows us to take advantage of the knowledge already learned by a model during its initial training on a large, general dataset. This is especially useful in domains where labeled data is scarce, such as in the medical field, where obtaining labeled data is both time-consuming and expensive.

In our case, where the vocabulary is highly specialized (medical terminology) and the available dataset for training is limited, fine-tuning is a more effective approach. By using a pre-trained model, we can leverage the general knowledge it has acquired, especially in related fields, and then fine-tune it to perform well on our specific task of classifying RCP outcomes.

We decided to fine-tune BioLinkBERT, a specialized model for medical text, which has been pre-trained on large datasets from PubMed and medical reports, including RCPs. BioLinkBERT is a distilled version of BERT, with 110 million parameters, designed to better handle medical vocabulary and context. This makes it an ideal choice for our task, as it already has a good understanding of medical terms and structures, and we only need to adapt it to our specific classification problem.

To fine-tune the BioLinkBERT model, we use the Hugging Face library, which provides an easy-to-use interface for working with pre-trained models. Here's how we can proceed:

We begin by loading the pre-trained BioLinkBERT model from the Hugging Face Model Hub. The model has already been trained on medical data (PubMed), so it understands medical terminology and language patterns.

Preparing the Dataset: We preprocess our dataset of RCP reports into a format suitable for the model. This typically involves tokenizing the text into smaller pieces (tokens) using the same tokenizer that was used for the pre-training of BioLinkBERT. The text is then encoded into numerical vectors that the model can process.

BioLinkBERT, like many transformer models, is designed for multiple types of tasks. To adapt it to our specific binary classification problem (invasive vs. non-invasive treatment), we add a classification head on top of the model. This classification head is a simple fully connected (dense) layer that takes the output from the last hidden layer of BioLinkBERT and predicts the binary class label. The final output is a probability score between 0 and 1, representing the likelihood of the report belonging to the invasive treatment class.

Since this is a binary classification task, we use the Binary Cross-Entropy loss function. This loss function calculates the difference between the predicted probability and the true class label, helping the model learn to predict the correct class. The goal is to minimize this loss during training.

The model is trained using a GPU (in this case, a T4 GPU) to speed up the training process. During training, the model’s weights are updated using backpropagation to minimize the loss. Since our dataset is small, we use a smaller learning rate to avoid overfitting and to allow the model to learn subtle adjustments specific to our dataset.

Fine-Tuning: Fine-tuning involves running the model through several epochs of training, adjusting its weights based on the medical RCP data. We apply techniques such as early stopping to prevent overfitting and monitor the model’s performance on a validation set to ensure it generalizes well to unseen data.

By the end of the fine-tuning process, the model will be adapted to classify RCP outcomes based on the specialized medical vocabulary and structure present in our dataset, making it more suitable and accurate for our specific task.

This approach is preferable to using a model built from scratch because it significantly reduces the amount of data and computational resources needed. Fine-tuning a model already familiar with medical language allows us to focus on task-specific nuances and ensure better performance on our specialized classification task.

Results

We achieved an accuracy of 0.76 on the binary classification task and 0.69 on the 5-class classification task. These results are promising, especially considering the initial challenges of working with a small and highly imbalanced dataset. Despite the limited data available for training, the model demonstrated its ability to make reliable predictions, highlighting the effectiveness of fine-tuning and data augmentation techniques. Given the circumstances, these performance metrics are encouraging and suggest that our approach is capable of addressing the task, even with a relatively small and unbalanced dataset.

Post-OC Analysis

Post-OC (Post-Outcome Classification) analysis in NLP refers to the process of evaluating and interpreting the results of a classification model after it has made its predictions. In our case, Post-OC analysis helps to better understand the model's decisions, evaluate its strengths and weaknesses, and refine it for more accurate predictions in the future. Specifically, this analysis allows us to identify patterns in the predictions, such as misclassifications or areas where the model's performance could be improved. It also aids in validating the model’s decision-making process by aligning the predicted outcomes with the actual RCP reports and medical decisions.

Post-OC analysis is performed by examining the confusion matrix, which compares the predicted labels with the true labels. This enables us to identify which classes are being misclassified and why. We can also perform error analysis to detect specific cases where the model might struggle, such as cases with ambiguous or sparse information. Additionally, techniques like model explainability (e.g., SHAP or LIME) can be used to gain insights into the model's decision-making process, revealing the factors influencing the model’s predictions. This in-depth analysis ensures that the model remains transparent and helps to improve its performance by iterating on its predictions.

In our case, the Post-OC analysis reveals that the model has successfully identified the key elements that contribute to the classification of the RCP outcomes in most cases. After a review by the physician, it was confirmed that the model, in a large majority of instances, takes into account the correct factors when making its decisions. This validation provides strong evidence that the model is interpreting the RCP reports in a way that aligns with medical expertise, ensuring that it relies on the most relevant clinical information for classification. Such findings reinforce the model's potential to assist in real-world decision-making and support its deployment in medical contexts.

Conclusion and Future Works

In conclusion, the results obtained from this project are very promising, demonstrating the potential of using AI-driven NLP models to predict the outcomes of multidisciplinary case reviews (RCP). While the model performed well, with an accuracy of 0.79 for binary classification and 0.69 for 5-class classification, the performance could likely improve with more training data. The limited data available can be attributed to the slow process of manual classification and anonymization.

Moving forward, there are several key areas for future work. First, obtaining more training data is essential to improve model performance. One way to achieve this is by creating an automated pipeline for labeling data from the "A" files by training a classification model. Additionally, implementing an automated pseudonymization pipeline would streamline the data preparation process. Another important task is refining the data augmentation pipeline by analyzing which elements of the text can be replaced without altering the final prediction, further boosting data diversity.

Further exploration could focus on studying the influence of specific sub-paragraphs in the RCP reports on the final decision. Understanding which sections of the report have the most significant impact on treatment decisions at HEGP would provide valuable insights for improving model predictions and enhancing clinical decision-making support.