Industry experts, Remco Jan Geukes Foppen, Vincenzo Gioia, Alessio Zoccoli and Carlos Velez reflect on the necessity to ensure data quality in order to gain full advantage from multimodal language models (MLMs).
Regulatory bodies like the US Food and Drug Administration (FDA) and European Medicines Agency (EMA) are emphasising the importance of data quality for AI applications in healthcare. High-quality and well-structured data and metadata are characterised by accuracy, consistency and completeness, which is essential for reliable insights and trustworthy AI outputs. Addressing data quality challenges, including missing data, inconsistencies and biases, is paramount for realising the full potential of multimodal language models (MLMs) in drug discovery and ensuring patient safety.
This report addresses the key factors shaping pharmaceutical formulation, including regulation, QC and analysis.
Access the full report now to discover the techniques, tools and innovations that are transforming pharmaceutical formulation, and learn how to position your organisation for long-term success.
What you’ll discover:
Key trends shaping the pharmaceutical formulation sector
Innovations leading progress in pharmaceutical formulation and how senior professionals can harness their benefits
Considerations and best practices when utilising QbD during formulation of oral solid dosage forms
Can’t attend live? No worries – register to receive the recording post-event.
MLMs owe their success to key factors: open access policies and the introduction of next-generation sequencing (NGS).1 Access to open clinical data has made available large amounts of data that are the raw material needed to train multimodal models. NGS has generated huge and complex datasets, ideal for the integration capabilities of generative AI (GenAI). Clinical genomics, through patient stratification and the validation of therapeutic targets, has highlighted the value of a multimodal approach, based on genomic, clinical and pharmacological data, to identify personalised therapies. These factors have created an ecosystem that allows MLM to address the challenges of drug development.
MLMs analyse genomic data, images and scientific literature simultaneously, uncovering correlations and interactions more rapidly that would typically take years of manual research. The automatic generation of molecular structures accelerates drug candidate design by predicting target affinity.
Through increased efficiency, MLMs can lower drug development costs. This is achieved by early identification and elimination of low-potential drug candidates, which minimises resource waste, and in preclinical phases it combines laboratory data with computational models for more accurate safety and efficacy predictions. Real-time trial analysis enables quick issue identification, preventing unnecessary protocol modifications. Furthermore, automating complex tasks like histological image analysis reduces costs associated with manual labour and human resources.
Through increased efficiency, MLMs can lower drug development costs
MLMs also analyse extensive data volumes to predict clinical trial outcomes more accurately, allowing companies to adjust strategies based on probability of success (PoS). Targets are validated with greater confidence, thereby reducing risk. Advanced stratification enables the design of subgroup-specific drugs, enhancing efficacy and lowering failure risks, particularly in oncology with targeted therapies. Additionally, GenAI enhances clinical trial design by identifying eligible patients and shortening recruitment times.
Managing and integrating data from heterogeneous sources — such as genomic sequences, clinical data, biological images and chemical structures — presents a complex challenge in terms of data normalisation. Differences in data formats, data quality and granularity make it difficult to create robust and standardised analytical pipelines. Furthermore, the interpretation of patterns generated by multimodal models often requires multidisciplinary skills that are not yet widely used.
Data quality, GenAI and MLMs
High-quality data is characterised by accuracy, completeness, consistency, timeliness and uniqueness. Data quality is foundational for generating reliable insights, predictions, and driving consequential decision-making. It sets the upper bound for accurate, relevant and coherent results. The challenge is well known, yet efforts towards improving data quality are still limited. For AI systems to produce trustworthy and explainable results,2 high-quality data and comprehensive metadata are indispensable. Continuous evaluation of data quality metrics is crucial, alongside the prioritisation of measures to mitigate biases.3 This ongoing assessment ensures the integrity and reliability of the data used in AI models and the subsequent outputs. With repositories that are now available and the ensuing data augmentation, data quality is vital to enhance the capability to understand and generate content across different modalities.
Key to enabling data quality are well-defined metrics and suitable operations. Data quality helps models produce better results and data scientists reduce their effort on debugging their informatics pipelines.
Some metrics when assessing data quality are:
Factual and conceptual accuracy: information, facts and definitions are correct, improving model trustworthiness and reliability
Contextual accuracy: accurate context ensures that the information is relevant, self-contained and applicable to the intended use cases, enabling better generation for the models
Consistency: consistency with respect to both data and sources. Consistent data improves the ability of models to learn and extract patterns and mitigate biases. Also, consistency across data sources should be taken into consideration; one should not fear having heterogeneous sources, yet it is crucial to afford the model a way to understand information in a cohesive and consistent way.
Common challenges and solutions for data quality
Missing data: this is an obstacle to accuracy as it omits valuable context and information. This relates also to data discontinuity. There are pockets of very rich data and pockets of very sparse data.
Inconsistencies: data should be formatted and labelled in a well-defined way, as ambiguity would trick both humans and models. Also, this aspect is crucial for interpretable results.
Duplicate data: redundancy may introduce noise and strengthen biases, leading to poor performing models.
Data traceability and immutability: meticulously documenting metadata pertaining to data sources, quality and context, to provide AI applications with the necessary contextual information during data processing.
Data integrity and regulatory compliance: building reliable AI models
There are also legal, ethical and financial challenges. In the life sciences domain, as decision-making plays a crucial role in people’s lives, risks cannot be ignored. Misguided AI predictions pose serious concerns about accountability and liability. In a sense, data quality is not only a technical challenge, but a legal one too. Notably, the FDA issued draft guidance4 this year emphasising the importance of data quality, defining “fit-for-use” with metrics like relevance and reliability. The FDA’s draft guidance on AI in drug development stresses data quality as crucial for reliable AI-driven results. Variability in data quality, size, and how it’s represented can introduce bias and undermine confidence in AI model outputs.
In drug development…the potential impact of multimodal analysis is immense
The guidance urges the use of relevant and reliable datasets for training, evaluating and maintaining AI models. A key component is a risk-based credibility assessment framework, which includes defining the question of interest, the ‘context of use’ and assessing the model risk. This assessment includes evaluating the data used in model development and ensuring its adequacy. Furthermore, the guidance highlights the need for lifecycle maintenance, including continuous monitoring5 of model output and accuracy, to address potential data drift and ensure consistent performance over time.
In the EU, this topic is reinforced by the EU AI Act,6 which regulates the accountability for AI deployment. It mandates that organisations meticulously document the methodologies used to generate AI outputs, clearly articulate the intended purpose of the AI system, and obtain explicit consent for all data used. Furthermore, all data inputs must be subject to rigorous consent procedures and undergo thorough filtering processes.
High-quality multimodal data, such as well-labelled medical images, genomic sequences and textual annotations, enable these models to identify patterns, generate insights and assist in diagnostics with greater accuracy. Conversely, poor data quality – characterised by errors, perpetuating biases or inconsistencies – can lead to erratic decision-making, flawed diagnostics and the propagation of misinformation, which may have problematic implications in sensitive areas like healthcare and drug discovery.
A strategic vision for multimodal AI-driven drug development
While AI has long been utilised in drug discovery, its application has largely been confined to ‘point solutions,’ addressing isolated inefficiencies or obstacles. Integration of AI, particularly MLMs, with vast genomic and clinical datasets enabled by advancements in next-generation sequencing and open data policies, is paving the way for integrated, end-to-end AI. Because the drug development pipeline inherently relies on integrating diverse data sources – from molecular structures to clinical outcomes – MLM offers a powerful data-driven framework for uncovering meaningful insights, ultimately aiming to increase PoS and expedite the development of novel therapies.7 By revealing intricate biological relationships that unimodal approaches miss, it provides a richer understanding of complex systems.
If AI in drug development is to transition from fragmented point solutions to integrated, end-to-end approaches, challenges must be addressed, such as managing data complexity, ensuring data quality and addressing ethical considerations related to data privacy and consent. The potential impact of multimodal analysis in this field is immense. This shift fosters a systems-thinking approach, reducing redundancies and optimising strategies based on probability of success. It enables management to make data-driven decisions regarding resource allocation, project prioritisation, and go/no-go decisions. The refined PoS assessment would be aimed to reduce risk and accelerate the development of more effective and targeted therapies.
The adoption of MLMs has significantly impacted the pharmaceutical industry by enhancing operational efficiency and potentially improving treatment quality for patients. Benefits include quicker access to pharmacological treatments due to faster drug development and marketing processes, as well as reduced treatment costs stemming from overall development efficiencies. Moreover, GenAI’s ability to learn from new data fosters continuous innovation and opens up new therapeutic opportunities.
About the authors
Remco Jan Geukes Foppen, PhD, is an AI and life sciences expert specialising in the pharmaceutical sector. His leadership has driven international commercial success in areas including image analysis, data management, bioinformatics, advanced clinical trial data analysis leveraging machine learning and federated learning. Remco’s academic background includes a PhD in biology and a master’s degree in chemistry, both from the University of Amsterdam.
Vincenzo Gioia is an AI innovation strategist. He is a business and technology executive, with a 20-year focus on quality and precision for the commercialisation of innovative tools. Vincenzo specialises in AI applied to image analysis, business intelligence and excellence. His focus on the human element of technology applications has led to high rates of solution implementation. He holds a master’s degree from University of Salerno in political sciences and marketing.
Alessio Zoccoli applies AI for a sustainable future. His deep understanding of industry applications and technical expertise drives innovation in AI-powered solutions for complex business challenges. He specialises in cutting-edge advancements in natural language processing, computer vision and generative AI. He is a senior data scientist and holds a master’s degree from Roma Tre University in computer engineering, where he also held the role of research fellow.
Carlos Velez, PhD, MBA, is a pharmaceutical and biotechnology strategic advisor, with 25 years experience in consulting, venture capital, corporate strategy, and entrepreneurship. Carlos specialises in helping pharmaceutical and biotechnology companies develop their in- and out-licensing strategies, with additional expertise and experience in portfolio assessment and prioritization, drug candidate valuation, valuation, and related services. He also develops and presents customized training programs (both live and virtual) for companies seeking to improve their in- and out-licensing processes. He holds a PhD in Pharmacy from the University of North Carolina at Chapel Hill, and an MBA from the Rochester Institute of Technology, US.
The EU Artificial Intelligence Act Up-to-date developments and analyses of the EU AI Act. [Internet] Future of Life Institute. Available from: https://artificialintelligenceact.eu/
This website uses cookies to enable, optimise and analyse site operations, as well as to provide personalised content and allow you to connect to social media. By clicking "I agree" you consent to the use of cookies for non-essential functions and the related processing of personal data. You can adjust your cookie and associated data processing preferences at any time via our "Cookie Settings". Please view our Cookie Policy to learn more about the use of cookies on our website.
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorised as ”Necessary” are stored on your browser as they are as essential for the working of basic functionalities of the website. For our other types of cookies “Advertising & Targeting”, “Analytics” and “Performance”, these help us analyse and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these different types of cookies. But opting out of some of these cookies may have an effect on your browsing experience. You can adjust the available sliders to ‘Enabled’ or ‘Disabled’, then click ‘Save and Accept’. View our Cookie Policy page.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Cookie
Description
cookielawinfo-checkbox-advertising-targeting
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertising & Targeting".
cookielawinfo-checkbox-analytics
This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Analytics".
cookielawinfo-checkbox-necessary
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance
This cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category "Performance".
PHPSESSID
This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zmember_logged
This session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.
Performance cookies are includes cookies that deliver enhanced functionalities of the website, such as caching. These cookies do not store any personal information.
Cookie
Description
cf_ob_info
This cookie is set by Cloudflare content delivery network and, in conjunction with the cookie 'cf_use_ob', is used to determine whether it should continue serving “Always Online” until the cookie expires.
cf_use_ob
This cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.
free_subscription_only
This session cookie is served by our membership/subscription system and controls which types of content you are able to access.
ls_smartpush
This cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.
one_signal_sdk_db
This cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.
YSC
This cookie is set by Youtube and is used to track the views of embedded videos.
Analytics cookies collect information about your use of the content, and in combination with previously collected information, are used to measure, understand, and report on your usage of this website.
Cookie
Description
bcookie
This cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
GPS
This cookie is set by YouTube and registers a unique ID for tracking users based on their geographical location
lang
This cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc
This cookie is set by LinkedIn and used for routing.
lissc
This cookie is set by LinkedIn share Buttons and ad tags.
vuid
We embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.
wow.anonymousId
This cookie is set by Spotler and tracks an anonymous visitor ID.
wow.schedule
This cookie is set by Spotler and enables it to track the Load Balance Session Queue.
wow.session
This cookie is set by Spotler to track the Internet Information Services (IIS) session state.
wow.utmvalues
This cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on.
_ga
This cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors.
_gat
This cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites.
_gid
This cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.
Advertising and targeting cookies help us provide our visitors with relevant ads and marketing campaigns.
Cookie
Description
advanced_ads_browser_width
This cookie is set by Advanced Ads and measures the browser width.
advanced_ads_page_impressions
This cookie is set by Advanced Ads and measures the number of previous page impressions.
advanced_ads_pro_server_info
This cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.
advanced_ads_pro_visitor_referrer
This cookie is set by Advanced Ads and sets the referrer URL.
bscookie
This cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.
IDE
This cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
li_sugr
This cookie is set by LinkedIn and is used for tracking.
UserMatchHistory
This cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
VISITOR_INFO1_LIVE
This cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.