The rise of multimodal language models in drug development

Posted: 12 June 2025 | Alessio Zoccoli, Dr Carlos Velez, Dr Remco Jan Geukes Foppen, Vincenzo Gioia | No comments yet

Industry experts, Remco Jan Geukes Foppen, Vincenzo Gioia, Alessio Zoccoli and Carlos Velez reflect on the necessity to ensure data quality in order to gain full advantage from multimodal language models (MLMs).

Regulatory bodies like the US Food and Drug Administration (FDA) and European Medicines Agency (EMA) are emphasising the importance of data quality for AI applications in healthcare. High-quality and well-structured data and metadata are characterised by accuracy, consistency and completeness, which is essential for reliable insights and trustworthy AI outputs. Addressing data quality challenges, including missing data, inconsistencies and biases, is paramount for realising the full potential of multimodal language models (MLMs) in drug discovery and ensuring patient safety.

Reserve your FREE place

Address the time-to-result challenge posed by short shelf-life radiopharmaceuticals.

20 November 2025 | 3:00 PM GMT | FREE Virtual Panel Discussion

This webinar showcases the Growth Direct System; an RMM (Rapid Microbial Method) that improves on traditional membrane filtration, delivering increased accuracy, a faster time to result, enhanced data integrity compliance, and more control over the manufacturing process.

Key learning points:

Understand the benefits of full workflow microbiology quality control testing automation in radiopharmaceutical production
Learn about ITM’s implementation journey and considerations when evaluating the technology
Find out how the advanced optics and microcolony detection capabilities of Growth Direct® technology impact time to result (TTR).

Don’t miss your chance to learn from experts in the industry – Register for FREE

MLMs owe their success to key factors: open access policies and the introduction of next-generation sequencing (NGS).¹ Access to open clinical data has made available large amounts of data that are the raw material needed to train multimodal models. NGS has generated huge and complex datasets, ideal for the integration capabilities of generative AI (GenAI). Clinical genomics, through patient stratification and the validation of therapeutic targets, has highlighted the value of a multimodal approach, based on genomic, clinical and pharmacological data, to identify personalised therapies. These factors have created an ecosystem that allows MLM to address the challenges of drug development.

MLMs analyse genomic data, images and scientific literature simultaneously, uncovering correlations and interactions more rapidly that would typically take years of manual research. The automatic generation of molecular structures accelerates drug candidate design by predicting target affinity.

Through increased efficiency, MLMs can lower drug development costs. This is achieved by early identification and elimination of low-potential drug candidates, which minimises resource waste, and in preclinical phases it combines laboratory data with computational models for more accurate safety and efficacy predictions. Real-time trial analysis enables quick issue identification, preventing unnecessary protocol modifications. Furthermore, automating complex tasks like histological image analysis reduces costs associated with manual labour and human resources.

Through increased efficiency, MLMs can lower drug development costs

MLMs also analyse extensive data volumes to predict clinical trial outcomes more accurately, allowing companies to adjust strategies based on probability of success (PoS). Targets are validated with greater confidence, thereby reducing risk. Advanced stratification enables the design of subgroup-specific drugs, enhancing efficacy and lowering failure risks, particularly in oncology with targeted therapies. Additionally, GenAI enhances clinical trial design by identifying eligible patients and shortening recruitment times.

Managing and integrating data from heterogeneous sources — such as genomic sequences, clinical data, biological images and chemical structures — presents a complex challenge in terms of data normalisation. Differences in data formats, data quality and granularity make it difficult to create robust and standardised analytical pipelines. Furthermore, the interpretation of patterns generated by multimodal models often requires multidisciplinary skills that are not yet widely used.

Data quality, GenAI and MLMs

High-quality data is characterised by accuracy, completeness, consistency, timeliness and uniqueness. Data quality is foundational for generating reliable insights, predictions, and driving consequential decision-making. It sets the upper bound for accurate, relevant and coherent results. The challenge is well known, yet efforts towards improving data quality are still limited. For AI systems to produce trustworthy and explainable results,² high-quality data and comprehensive metadata are indispensable. Continuous evaluation of data quality metrics is crucial, alongside the prioritisation of measures to mitigate biases.³ This ongoing assessment ensures the integrity and reliability of the data used in AI models and the subsequent outputs. With repositories that are now available and the ensuing data augmentation, data quality is vital to enhance the capability to understand and generate content across different modalities.

GenAI in pharma: where are we today?

Key to enabling data quality are well-defined metrics and suitable operations. Data quality helps models produce better results and data scientists reduce their effort on debugging their informatics pipelines.

Some metrics when assessing data quality are:

Factual and conceptual accuracy: information, facts and definitions are correct, improving model trustworthiness and reliability
Contextual accuracy: accurate context ensures that the information is relevant, self-contained and applicable to the intended use cases, enabling better generation for the models
Consistency: consistency with respect to both data and sources. Consistent data improves the ability of models to learn and extract patterns and mitigate biases. Also, consistency across data sources should be taken into consideration; one should not fear having heterogeneous sources, yet it is crucial to afford the model a way to understand information in a cohesive and consistent way.

Common challenges and solutions for data quality

Missing data: this is an obstacle to accuracy as it omits valuable context and information. This relates also to data discontinuity. There are pockets of very rich data and pockets of very sparse data.

Inconsistencies: data should be formatted and labelled in a well-defined way, as ambiguity would trick both humans and models. Also, this aspect is crucial for interpretable results.

Duplicate data: redundancy may introduce noise and strengthen biases, leading to poor performing models.

Data traceability and immutability: meticulously documenting metadata pertaining to data sources, quality and context, to provide AI applications with the necessary contextual information during data processing.

Data integrity and regulatory compliance: building reliable AI models

There are also legal, ethical and financial challenges. In the life sciences domain, as decision-making plays a crucial role in people’s lives, risks cannot be ignored. Misguided AI predictions pose serious concerns about accountability and liability. In a sense, data quality is not only a technical challenge, but a legal one too. Notably, the FDA issued draft guidance⁴ this year emphasising the importance of data quality, defining “fit-for-use” with metrics like relevance and reliability. The FDA’s draft guidance on AI in drug development stresses data quality as crucial for reliable AI-driven results. Variability in data quality, size, and how it’s represented can introduce bias and undermine confidence in AI model outputs.

In drug development…the potential impact of multimodal analysis is immense

The guidance urges the use of relevant and reliable datasets for training, evaluating and maintaining AI models. A key component is a risk-based credibility assessment framework, which includes defining the question of interest, the ‘context of use’ and assessing the model risk. This assessment includes evaluating the data used in model development and ensuring its adequacy. Furthermore, the guidance highlights the need for lifecycle maintenance, including continuous monitoring⁵ of model output and accuracy, to address potential data drift and ensure consistent performance over time.

In the EU, this topic is reinforced by the EU AI Act,⁶ which regulates the accountability for AI deployment. It mandates that organisations meticulously document the methodologies used to generate AI outputs, clearly articulate the intended purpose of the AI system, and obtain explicit consent for all data used. Furthermore, all data inputs must be subject to rigorous consent procedures and undergo thorough filtering processes.

High-quality multimodal data, such as well-labelled medical images, genomic sequences and textual annotations, enable these models to identify patterns, generate insights and assist in diagnostics with greater accuracy. Conversely, poor data quality – characterised by errors, perpetuating biases or inconsistencies – can lead to erratic decision-making, flawed diagnostics and the propagation of misinformation, which may have problematic implications in sensitive areas like healthcare and drug discovery.

A strategic vision for multimodal AI-driven drug development

While AI has long been utilised in drug discovery, its application has largely been confined to ‘point solutions,’ addressing isolated inefficiencies or obstacles. Integration of AI, particularly MLMs, with vast genomic and clinical datasets enabled by advancements in next-generation sequencing and open data policies, is paving the way for integrated, end-to-end AI. Because the drug development pipeline inherently relies on integrating diverse data sources – from molecular structures to clinical outcomes – MLM offers a powerful data-driven framework for uncovering meaningful insights, ultimately aiming to increase PoS and expedite the development of novel therapies.⁷ By revealing intricate biological relationships that unimodal approaches miss, it provides a richer understanding of complex systems.

AI in pharmaceutical development: hype or panacea?

If AI in drug development is to transition from fragmented point solutions to integrated, end-to-end approaches, challenges must be addressed, such as managing data complexity, ensuring data quality and addressing ethical considerations related to data privacy and consent. The potential impact of multimodal analysis in this field is immense. This shift fosters a systems-thinking approach, reducing redundancies and optimising strategies based on probability of success. It enables management to make data-driven decisions regarding resource allocation, project prioritisation, and go/no-go decisions. The refined PoS assessment would be aimed to reduce risk and accelerate the development of more effective and targeted therapies.

The adoption of MLMs has significantly impacted the pharmaceutical industry by enhancing operational efficiency and potentially improving treatment quality for patients. Benefits include quicker access to pharmacological treatments due to faster drug development and marketing processes, as well as reduced treatment costs stemming from overall development efficiencies. Moreover, GenAI’s ability to learn from new data fosters continuous innovation and opens up new therapeutic opportunities.

About the authors

Remco Jan Geukes Foppen, PhD, is an AI and life sciences expert specialising in the pharmaceutical sector. His leadership has driven international commercial success in areas including image analysis, data management, bioinformatics, advanced clinical trial data analysis leveraging machine learning and federated learning. Remco’s academic background includes a PhD in biology and a master’s degree in chemistry, both from the University of Amsterdam.

Vincenzo Gioia is an AI innovation strategist. He is a business and technology executive, with a 20-year focus on quality and precision for the commercialisation of innovative tools. Vincenzo specialises in AI applied to image analysis, business intelligence and excellence. His focus on the human element of technology applications has led to high rates of solution implementation. He holds a master’s degree from University of Salerno in political sciences and marketing.

Alessio Zoccoli applies AI for a sustainable future. His deep understanding of industry applications and technical expertise drives innovation in AI-powered solutions for complex business challenges. He specialises in cutting-edge advancements in natural language processing, computer vision and generative AI. He is a senior data scientist and holds a master’s degree from Roma Tre University in computer engineering, where he also held the role of research fellow.

Carlos Velez, PhD, MBA, is a pharmaceutical and biotechnology strategic advisor, with 25 years experience in consulting, venture capital, corporate strategy, and entrepreneurship. Carlos specialises in helping pharmaceutical and biotechnology companies develop their in- and out-licensing strategies, with additional expertise and experience in portfolio assessment and prioritization, drug candidate valuation, valuation, and related services. He also develops and presents customized training programs (both live and virtual) for companies seeking to improve their in- and out-licensing processes. He holds a PhD in Pharmacy from the University of North Carolina at Chapel Hill, and an MBA from the Rochester Institute of Technology, US.

References

Geukes Foppen RJ, Gioia V, Zoccoli A, Velez CN. Using Clinical Genomics And AI In Drug Development To Elevate Success. Drug Target Review February Edition. 2025. Available from: https://www.drugtargetreview.com/article/155906/clinical-genomics-ai-drug-success/
Gioia V, Geukes Foppen RJ. “Explainambiguity:” When What You Think Is Not What You Get. [Internet] Life Science Leader. 2024. [cited 2025Feb]. Available from: https://www.lifescienceleader.com/doc/explainambiguity-when-what-you-think-is-not-what-you-get-0001
Gioia V, Geukes Foppen RJ. Correct But Misleading: AI Hallucinations In Complex Decision-Making [Internet] Life Science Leader. 2024. [cited 2025Feb]. Available from: https://www.lifescienceleader.com/doc/correct-but-misleading-ai-hallucinations-in-complex-decision-making-0001
FDA Proposes Framework to Advance Credibility of AI Models Used for Drug and Biological Product Submissions. [Internet] US Food and Drug Administration (FDA). 2025. Available from: https://www.fda.gov/news-events/press-announcements/fda-proposes-framework-advance-credibility-ai-models-used-drug-and-biological-product-submissions
Geukes Foppen RJ, Gioia V, Jan R, Gupta S, et al. Methodology for Safe and Secure AI in Diabetes Management. JDST. 2024. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC11672366/
The EU Artificial Intelligence Act Up-to-date developments and analyses of the EU AI Act. [Internet] Future of Life Institute. Available from: https://artificialintelligenceact.eu/
Geukes Foppen RJ, Gioia V, Zoccoli A, Velez CN. Navigating The AI Revolution: A Roadmap For Pharma’s Future. Drug Target Review February Edition. 2025. Available from: https://www.drugtargetreview.com/article/155906/clinical-genomics-ai-drug-success/

Related organisations

European Medicines Agency (EMA), US Food and Drug Administration (FDA)

Cookie	Description
cookielawinfo-checkbox-advertising-targeting	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged	This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Cookie	Description
cf_ob_info	This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob	This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only	This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush	This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db	This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC	This cookie is set by Youtube and is used to track the views of embedded videos.

Cookie	Description
bcookie	This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS	This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang	This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	This cookie is set by LinkedIn and used for routing.
lissc	This cookie is set by LinkedIn share Buttons and ad tags.
vuid	We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId	This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule	This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session	This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues	This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga	This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat	This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid	This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Cookie	Description
advanced_ads_browser_width	This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions	This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info	This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer	This cookie is set by Advanced Ads and sets the referrer URL.
bscookie	This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE	This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr	This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory	This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE	This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Recommended

The rise of multimodal language models in drug development

Address the time-to-result challenge posed by short shelf-life radiopharmaceuticals.

Data quality, GenAI and MLMs

Common challenges and solutions for data quality

Missing data: this is an obstacle to accuracy as it omits valuable context and information. This relates also to data discontinuity. There are pockets of very rich data and pockets of very sparse data.

Inconsistencies: data should be formatted and labelled in a well-defined way, as ambiguity would trick both humans and models. Also, this aspect is crucial for interpretable results.

Duplicate data: redundancy may introduce noise and strengthen biases, leading to poor performing models.

Data traceability and immutability: meticulously documenting metadata pertaining to data sources, quality and context, to provide AI applications with the necessary contextual information during data processing.

Data integrity and regulatory compliance: building reliable AI models

A strategic vision for multimodal AI-driven drug development

About the authors

References

Related topics

Related organisations

Related people

Leave a Reply Cancel reply

Recommended

The rise of multimodal language models in drug development

Address the time-to-result challenge posed by short shelf-life radiopharmaceuticals.

Data quality, GenAI and MLMs

Common challenges and solutions for data quality

Missing data: this is an obstacle to accuracy as it omits valuable context and information. This relates also to data discontinuity. There are pockets of very rich data and pockets of very sparse data.

Inconsistencies: data should be formatted and labelled in a well-defined way, as ambiguity would trick both humans and models. Also, this aspect is crucial for interpretable results.

Duplicate data: redundancy may introduce noise and strengthen biases, leading to poor performing models.

Data traceability and immutability: meticulously documenting metadata pertaining to data sources, quality and context, to provide AI applications with the necessary contextual information during data processing.

Data integrity and regulatory compliance: building reliable AI models

A strategic vision for multimodal AI-driven drug development

About the authors

References

Related topics

Related organisations

Related people

Lipid formulations in softgels – enhancing bioavailability and therapeutic efficacy

Insmed wins first EU approval for non-cystic fibrosis bronchiectasis treatment

Internationally-compliant framework backs rabbit pyrogen test alternative

PharmaLab previews new conference tracks ahead of 2025 Düsseldorf event

MHRA approves Leqembi IV maintenance as additional Alzheimer’s option

Leave a Reply Cancel reply