article

The rise of multimodal language models in drug development

Industry experts, Remco Jan Geukes Foppen, Vincenzo Gioia, Alessio Zoccoli and Carlos Velez reflect on the necessity to ensure data quality in order to gain full advantage from multimodal language models (MLMs).

AI image for MLM article

Regulatory bodies like the US Food and Drug Administration (FDA) and European Medicines Agency (EMA) are emphasising the importance of data quality for AI applications in healthcare. High-quality and well-structured data and metadata are characterised by accuracy, consistency and completeness, which is essential for reliable insights and trustworthy AI outputs. Addressing data quality challenges, including missing data, inconsistencies and biases, is paramount for realising the full potential of multimodal language models (MLMs) in drug discovery and ensuring patient safety.

MLMs owe their success to key factors: open access policies and the introduction of next-generation sequencing (NGS).1 Access to open clinical data has made available large amounts of data that are the raw material needed to train multimodal models. NGS has generated huge and complex datasets, ideal for the integration capabilities of generative AI (GenAI). Clinical genomics, through patient stratification and the validation of therapeutic targets, has highlighted the value of a multimodal approach, based on genomic, clinical and pharmacological data, to identify personalised therapies. These factors have created an ecosystem that allows MLM to address the challenges of drug development.

MLMs analyse genomic data, images and scientific literature simultaneously, uncovering correlations and interactions more rapidly that would typically take years of manual research. The automatic generation of molecular structures accelerates drug candidate design by predicting target affinity.

Through increased efficiency, MLMs can lower drug development costs. This is achieved by early identification and elimination of low-potential drug candidates, which minimises resource waste, and in preclinical phases it combines laboratory data with computational models for more accurate safety and efficacy predictions. Real-time trial analysis enables quick issue identification, preventing unnecessary protocol modifications. Furthermore, automating complex tasks like histological image analysis reduces costs associated with manual labour and human resources.

Through increased efficiency, MLMs can lower drug development costs

MLMs also analyse extensive data volumes to predict clinical trial outcomes more accurately, allowing companies to adjust strategies based on probability of success (PoS). Targets are validated with greater confidence, thereby reducing risk. Advanced stratification enables the design of subgroup-specific drugs, enhancing efficacy and lowering failure risks, particularly in oncology with targeted therapies. Additionally, GenAI enhances clinical trial design by identifying eligible patients and shortening recruitment times.

Managing and integrating data from heterogeneous sources — such as genomic sequences, clinical data, biological images and chemical structures — presents a complex challenge in terms of data normalisation. Differences in data formats, data quality and granularity make it difficult to create robust and standardised analytical pipelines. Furthermore, the interpretation of patterns generated by multimodal models often requires multidisciplinary skills that are not yet widely used.

Data quality, GenAI and MLMs

High-quality data is characterised by accuracy, completeness, consistency, timeliness and uniqueness. Data quality is foundational for generating reliable insights, predictions, and driving consequential decision-making. It sets the upper bound for accurate, relevant and coherent results. The challenge is well known, yet efforts towards improving data quality are still limited. For AI systems to produce trustworthy and explainable results,2 high-quality data and comprehensive metadata are indispensable. Continuous evaluation of data quality metrics is crucial, alongside the prioritisation of measures to mitigate biases.3 This ongoing assessment ensures the integrity and reliability of the data used in AI models and the subsequent outputs. With repositories that are now available and the ensuing data augmentation, data quality is vital to enhance the capability to understand and generate content across different modalities.

GenAI in pharma: where are we today?

Key to enabling data quality are well-defined metrics and suitable operations. Data quality helps models produce better results and data scientists reduce their effort on debugging their informatics pipelines.

Some metrics when assessing data quality are:

  • Factual and conceptual accuracy: information, facts and definitions are correct, improving model trustworthiness and reliability
  • Contextual accuracy: accurate context ensures that the information is relevant, self-contained and applicable to the intended use cases, enabling better generation for the models
  • Consistency: consistency with respect to both data and sources. Consistent data improves the ability of models to learn and extract patterns and mitigate biases. Also, consistency across data sources should be taken into consideration; one should not fear having heterogeneous sources, yet it is crucial to afford the model a way to understand information in a cohesive and consistent way.

Common challenges and solutions for data quality

Missing data: this is an obstacle to accuracy as it omits valuable context and information. This relates also to data discontinuity. There are pockets of very rich data and pockets of very sparse data.

Inconsistencies: data should be formatted and labelled in a well-defined way, as ambiguity would trick both humans and models. Also, this aspect is crucial for interpretable results.

Duplicate data: redundancy may introduce noise and strengthen biases, leading to poor performing models.

Data traceability and immutability: meticulously documenting metadata pertaining to data sources, quality and context, to provide AI applications with the necessary contextual information during data processing.

Data integrity and regulatory compliance: building reliable AI models

There are also legal, ethical and financial challenges. In the life sciences domain, as decision-making plays a crucial role in people’s lives, risks cannot be ignored. Misguided AI predictions pose serious concerns about accountability and liability. In a sense, data quality is not only a technical challenge, but a legal one too. Notably, the FDA issued draft guidance4 this year emphasising the importance of data quality, defining “fit-for-use” with metrics like relevance and reliability. The FDA’s draft guidance on AI in drug development stresses data quality as crucial for reliable AI-driven results. Variability in data quality, size, and how it’s represented can introduce bias and undermine confidence in AI model outputs.

In drug development…the potential impact of multimodal analysis is immense

The guidance urges the use of relevant and reliable datasets for training, evaluating and maintaining AI models. A key component is a risk-based credibility assessment framework, which includes defining the question of interest, the ‘context of use’ and assessing the model risk. This assessment includes evaluating the data used in model development and ensuring its adequacy. Furthermore, the guidance highlights the need for lifecycle maintenance, including continuous monitoring5 of model output and accuracy, to address potential data drift and ensure consistent performance over time.

In the EU, this topic is reinforced by the EU AI Act,6 which regulates the accountability for AI deployment. It mandates that organisations meticulously document the methodologies used to generate AI outputs, clearly articulate the intended purpose of the AI system, and obtain explicit consent for all data used. Furthermore, all data inputs must be subject to rigorous consent procedures and undergo thorough filtering processes.

High-quality multimodal data, such as well-labelled medical images, genomic sequences and textual annotations, enable these models to identify patterns, generate insights and assist in diagnostics with greater accuracy. Conversely, poor data quality – characterised by errors, perpetuating biases or inconsistencies – can lead to erratic decision-making, flawed diagnostics and the propagation of misinformation, which may have problematic implications in sensitive areas like healthcare and drug discovery.

A strategic vision for multimodal AI-driven drug development

While AI has long been utilised in drug discovery, its application has largely been confined to ‘point solutions,’ addressing isolated inefficiencies or obstacles. Integration of AI, particularly MLMs, with vast genomic and clinical datasets enabled by advancements in next-generation sequencing and open data policies, is paving the way for integrated, end-to-end AI. Because the drug development pipeline inherently relies on integrating diverse data sources – from molecular structures to clinical outcomes – MLM offers a powerful data-driven framework for uncovering meaningful insights, ultimately aiming to increase PoS and expedite the development of novel therapies.7 By revealing intricate biological relationships that unimodal approaches miss, it provides a richer understanding of complex systems.

AI in pharmaceutical development: hype or panacea?

If AI in drug development is to transition from fragmented point solutions to integrated, end-to-end approaches, challenges must be addressed, such as managing data complexity, ensuring data quality and addressing ethical considerations related to data privacy and consent. The potential impact of multimodal analysis in this field is immense. This shift fosters a systems-thinking approach, reducing redundancies and optimising strategies based on probability of success. It enables management to make data-driven decisions regarding resource allocation, project prioritisation, and go/no-go decisions. The refined PoS assessment would be aimed to reduce risk and accelerate the development of more effective and targeted therapies.

The adoption of MLMs has significantly impacted the pharmaceutical industry by enhancing operational efficiency and potentially improving treatment quality for patients. Benefits include quicker access to pharmacological treatments due to faster drug development and marketing processes, as well as reduced treatment costs stemming from overall development efficiencies. Moreover, GenAI’s ability to learn from new data fosters continuous innovation and opens up new therapeutic opportunities.

About the authors

Remco Foppen MLM lead authorRemco Jan Geukes Foppen, PhD, is an AI and life sciences expert specialising in the pharmaceutical sector. His leadership has driven international commercial success in areas including image analysis, data management, bioinformatics, advanced clinical trial data analysis leveraging machine learning and federated learning. Remco’s academic background includes a PhD in biology and a master’s degree in chemistry, both from the University of Amsterdam.

 

Vincenzo Gioia MLM authorVincenzo Gioia is an AI innovation strategist. He is a business and technology executive, with a 20-year focus on quality and precision for the commercialisation of innovative tools. Vincenzo specialises in AI applied to image analysis, business intelligence and excellence. His focus on the human element of technology applications has led to high rates of solution implementation. He holds a master’s degree from University of Salerno in political sciences and marketing.

Alessio Zoccoli authorAlessio Zoccoli applies AI for a sustainable future. His deep understanding of industry applications and technical expertise drives innovation in AI-powered solutions for complex business challenges. He specialises in cutting-edge advancements in natural language processing, computer vision and generative AI. He is a senior data scientist and holds a master’s degree from Roma Tre University in computer engineering, where he also held the role of research fellow.

 

Carlos N Velez authorCarlos Velez, PhD, MBA, is a pharmaceutical and biotechnology strategic advisor, with 25 years experience in consulting, venture capital, corporate strategy, and entrepreneurship. Carlos specialises in helping pharmaceutical and biotechnology companies develop their in- and out-licensing strategies, with additional expertise and experience in portfolio assessment and prioritization, drug candidate valuation, valuation, and related services. He also develops and presents customized training programs (both live and virtual) for companies seeking to improve their in- and out-licensing processes. He holds a PhD in Pharmacy from the University of North Carolina at Chapel Hill, and an MBA from the Rochester Institute of Technology, US.

References

  1. Geukes Foppen RJ, Gioia V, Zoccoli A, Velez CN. Using Clinical Genomics And AI In Drug Development To Elevate Success. Drug Target Review February Edition. 2025. Available from: https://www.drugtargetreview.com/article/155906/clinical-genomics-ai-drug-success/
  2. Gioia V, Geukes Foppen RJ. “Explainambiguity:” When What You Think Is Not What You Get. [Internet] Life Science Leader. 2024. [cited 2025Feb]. Available from: https://www.lifescienceleader.com/doc/explainambiguity-when-what-you-think-is-not-what-you-get-0001
  3. Gioia V, Geukes Foppen RJ. Correct But Misleading: AI Hallucinations In Complex Decision-Making [Internet] Life Science Leader. 2024. [cited 2025Feb]. Available from: https://www.lifescienceleader.com/doc/correct-but-misleading-ai-hallucinations-in-complex-decision-making-0001
  4. FDA Proposes Framework to Advance Credibility of AI Models Used for Drug and Biological Product Submissions. [Internet] US Food and Drug Administration (FDA). 2025. Available from: https://www.fda.gov/news-events/press-announcements/fda-proposes-framework-advance-credibility-ai-models-used-drug-and-biological-product-submissions
  5. Geukes Foppen RJ, Gioia V, Gupta S, et al. Methodology for Safe and Secure AI in Diabetes Management. [Internet] 2024. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC11672366/
  6. The EU Artificial Intelligence Act Up-to-date developments and analyses of the EU AI Act. [Internet] Future of Life Institute. Available from: https://artificialintelligenceact.eu/
  7. Geukes Foppen RJ, Gioia V, Zoccoli A, Velez CN. Navigating The AI Revolution: A Roadmap For Pharma’s Future. Drug Target Review February Edition. 2025. Available from: https://www.drugtargetreview.com/article/155906/clinical-genomics-ai-drug-success/