Progress by the Proteomics Standards Initiative
Posted: 7 February 2009 | | No comments yet
There are compelling reasons for regularising the capture and description of proteomics data. Adhering to community-consensus specifications for the annotation of data sets can increase confidence in results and the conclusions drawn upon them, and supports data re-use; working with standard formats and vocabularies can raise efficiency and facilitates sophisticated approaches to data handling and analysis. The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO PSI) is a standards generating body comprising diverse members of the proteomics community and related trades. It develops reporting guidelines, data formats and vocabulary terms with which to describe the components of a proteomics experiment. This article briefly explores the benefits accruing to the use of reporting standards, for academics and for those in a commercial setting; describes HUPO PSI, its products and the status quo with respect to compatible tools and databases; and closes by pulling back to consider multi-domain investigations in the life sciences.
To underpin understanding of the components and outputs of a bioscience investigation it is important to have access to both the data and the available metadata (in essence, this is ‘data about the data’ such as instrument parameters, protocol and operator details, and the provenance of the sample being analysed)1,2. Focused, appropriate annotation simplifies the assessment of the validity of a data set, supports its comprehension when revisited or presented to others, and facilitates re-use either by the originator or third parties via public databases such as PRIDE3. The re-use of a data set increases the return on the initial investment in the work that generated it, and can reap collateral citations for its originator. Where data are particularly voluminous or complex, as is frequently the case in proteomics, such standard solutions increase in value by facilitating the design and construction of data pipelines and repositories.
The Human Proteome Organisation and the Proteomics Standards Initiative Proteomics is emerging from a long period of growth throughout which new and newly-applied techniques and technologies have been explored and in most cases brought to maturity. Proteoinformatics has followed a similar trajectory, exploring various approaches to the gathering, storage, exchange, presentation and analysis of proteomics data in its various forms. Throughout this period, the Human Proteome Organisation’s Proteomics Standards Initiative (HUPO PSI)4 has been developing standard formats and vocabularies with which to capture, annotate and exchange proteomics data, along with sets of Minimum Information (MI) guidelines that specify the (subset of the available) information that the proteomics community consider vital to support comprehension, validation and re-use.
The Human Proteome Organization (HUPO; http://www.hupo.org)5, founded in 2001, seeks to consolidate national and regional proteome organisations, facilitate large-scale public initiatives, engage in various scientific and pedagogical activities, such as the development and propagation of Standard Operating Procedures (SOPs), and foster standard mechanisms and requirements for the capture and exchange of proteomics data through the work of the Proteomics Standards Initiative (PSI)4. The PSI (http://www.psidev.info) generates data formats and supporting vocabularies for various applications, and reporting requirements documents (MIAPE)6 that contain a fair approximation of the community’s view on the key information to provide when reporting various aspects of a proteomics experiment.
The PSI necessarily operates in an entirely transparent manner: We have two free-to-attend meetings per year open to all interested parties and all PSI projects are described on the website, with links to public email discussion lists and all our products (note that all PSI-generated resources are developed under a permissive ‘open source’ model, ensuring full openness and minimally-constrained redistribution rights).
PSI reporting guidelines
As stated, the PSI generates reporting requirements documents that list the information to be provided when reporting the use of, and data generated by a particular technique. These are known as the MIAPE (Minimum Information About a Proteomics Experiment) guidelines.
The overall scheme is described along with our specific guiding principles in the ‘parent document’6; this is accompanied by a series of technique-specific modules, each containing a detailed list for a particular technique (or group thereof). The precise level of detail required will vary across the various techniques to be found in proteomics workflows, so to guide specific decisions on the data and metadata that should be required by each MIAPE module, two very general criteria are employed:
- Sufficiency: The MIAPE reporting requirements should require sufficient information about a dataset and its experimental context to allow a reader to understand and critically evaluate the interpretation and conclusions, and to support their experimental corroboration.
- Practicability: Achieving ‘MIAPE compliance’ should not be so burdensome as to prohibit the widespread use of the guidelines.
The first of these principles summarises a range of more concrete usage scenarios, all of which require the provision of an appropriately rich experimental description, such as: the discovery (for example, by database search) of data sets generated by specific techniques; the sharing of successfully-employed methods; the assessment of data and analyses in the light of the methods deployed; the informed corroboration of results through the performance of parallel or orthogonal studies; the maintenance of a subset of the records in a data archive for the purposes of rapid search and evaluation (important to those parts of industry falling under the FDA’s 21 CFR Part 11 regulation requiring the archiving of all electronic data); and the promotion of intercompatibility between public repositories and software tools (i.e. encouraging the development of resources built on the assumption that specific [meta]data will be present in all data sets certified as MIAPE compliant).
However, there is a trade-off between the depth to which an experiment could be described and the time the average experimentalist can be expected to devote to generating an appropriate description; this is the motivation for the second principle. The prevalence of computer technology in the lab offers some scope for automation; instrument manufacturers and LIMS/analysis software vendors will accommodate such guidelines where there is demand, simplifying the reporting process through the provision of MIAPE-compliant export facilities. This is also true of PSI-generated formats and the associated vocabularies as evidenced by the uptake by many mass spec manufacturers of the mzData format developed by the PSI. (N.B. mzData will soon be superseded by mzML, described below.)
The existing MIAPE modules, four published, three undergoing the PSI document approval process prior to submission for publication) are described below. Each module relates to a particular technology or group thereof:
- Column chromatography
[In process] The use of columns, of all scales and flow rates
- Capillary electrophoresis
[In process] The performance of any of the wide range of capillary electrophoresis protocols (e.g. MEKC, CZE, CGE IEF, CEC, etc.)
- Mass spectrometry
[Published]7 The use of a mass spectrometer; the generation of peak lists from raw data; quantitation based on the use of an isotopic or chemical label (the application of that label is a ‘sample handling’ though, and is therefore captured elsewhere)
- Mass spectrometry informatics
[Published]8 The use of software to analyse MS data (mass spectra, ion chromatograms). This includes search engines that assign peptides, proteins or biological class membership; the matching of assigned peptides, proteins or de novo sequence against a named database; the use of quality control measures
- Gel electrophoresis
[Published]9 The use of gel-based electrophoretic separation techniques, whether single- or multi-dimensional, native or denaturing; assorted visualisation techniques including ‘electroblotting’; image acquisition
- Gel image informatics
[In process] The processing, analysis and interrelation of gel images to characterise/identify spots or measure (relative) intensities/volumes; image warping and ‘registration’
- Molecular interaction experiments
[Published]10 The use of any of a range of techniques to determine a set of interacting molecules, within the context of a particular experiment. This includes such techniques as yeast two-hybrid and tandem affinity purification assays. N.B. This checklist is published separately to the main suite of MIAPE modules, under the title MIMIx (Minimum Information about a Molecular Interaction eXperiment).
Legacy data sets are problematic. Poor annotation does not mean that a data set is without worth, though quality is more difficult to assess. The MIAPE recommendation for such data sets is that they should be re-annotated to the fullest possible extent, but that data and metadata should never be ‘created’ to supplement real data. Where required data are irretrievably lost, or were never available in a practical sense, this should be clearly indicated.
If a particular technique is not covered, reports should attempt to match existing modules in terms of depth of coverage; ideally the provision or expansion of a MIAPE module for that technique should then be raised on a discussion list or at a PSI meeting.
PSI data formats and supporting vocabularies
The various PSI data formats (for which there are always accompanying vocabularies) roughly parallel the divisions between the MIAPE modules described above. In both cases these divisions represent natural breaks in a workflow (for example, mass spectra can be reanalysed without having to be regenerated by an instrument, therefore there is a natural division). Note that while the MIAPE documents do not specifically require that PSI data formats and vocabulary terms are used, it is the case that PSI data formats are designed to support those guidelines. Also note that all the products mentioned below are available from the PSI website (http://www.psidev.info).
The Molecular Interaction Format (MIF), the counterpart to the MIMIx guidelines described above, is designed to support the exchange of data about molecular interactions (e.g., protein-protein, protein-small molecule). Amongst other roles, MIF supports the exchange of data between databases of known molecular interactions under the IMEx agreement (http://imex.sourceforge.net) fostered by PSI, and more generally, is used for data import into analysis or visualisation tools such as Cytoscape (http://www.cytoscape.org).
As with all the formats described in this article, MIF works in tandem with a dedicated controlled vocabulary to perform its function. Separating out detailed descriptors from a format ensures that it will remain stable for longer periods of time (required changes are usually minor and can therefore generally be handled by updating the vocabulary alone); this encourages developers to invest more time in producing MIF-friendly tools.
The prevalence of this technique in proteomics, coupled with the pitfalls inherent in such a complex analytical process (especially when one includes peptide and protein identity assignment) ensured that the PSI and others11 made the provision of standards for mass spectrometry a priority. A working group featuring committed representatives of all the main instrument vendors in addition to academics moved quickly to produce draft reporting guidelines and a generic format for handling mass spectrometry data and metadata (with accompanying controlled vocabulary). That format, mzData, was received favorably by the community and implemented in many different software tools and databases3,12 and see http://www.psidev.info/index.php?q=node/95. Recently that same group drafted a revised and updated format, mzML, based on a merger of mzData and a second equivalent generic format named mzXML13. mzML will work in tandem with AnalysisXML (working title), which holds descriptors and data from the process of analyzing mass spectra.
The separation of proteins or peptides using gel electrophoresis is a complicated and labor-intensive process that generates a large quantity of metadata and data. To meet the needs of those performing gel electrophoresis (of whatever dimensionality), two formats, with accompanying vocabulary terms, are under development by a group again comprising both researchers and technology vendors.
GelML is a data exchange format for describing gel electrophoresis experiments. It addresses the preparation and running of a gel, and the generation of an image by scanning, autoradiography or other methods. In late 2007, GelML completed the PSI document process and was released as a stable Version 1.0. To date, two tools have been developed to assist in building a GelML file; an Excel spreadsheet and a form-based Java application14. The PSI-Gel group is also working on a format to hold descriptions of gel image manipulation and analysis, including the interrelation of gels and the quantitation of gel features.
Other separation techniques
Prefractionation of complex samples by making use of the physicochemical properties of their components, such as net charge, size or mass, is crucial for controlling sample complexity and revealing lower-abundance proteins. Separation or depletion by column chromatography is an important technique in proteomics. There are also a host of ‘minor’ methods (liquid phase isoelectric focusing of various kinds, capillary electrophoresis, etc.) whose use is not frequent enough to justify a significant investment of effort by PSI (the focusing on core techniques stems from the limited number of person-hours available to this volunteer organisation).
To ensure that all these techniques can be adequately described (either to meet a MIAPE requirement, or for some other purpose) PSI has generated SepML, which is a general format capable of holding a fairly well-structured description of a variety of separations.
Integration with other domains’ (candidate) standards
Searching the internet with the phrase ‘biological data standards’ reaped well over a million hits at the time of writing (using the most popular search engine). Numerous standards-generating projects exist within the biosciences, each serving its own domain. Minimum information guidelines, data formats and controlled vocabularies for these other domains are unlikely to dovetail with those generated by PSI by chance alone. Coordinating mechanisms are required if there is to be any hope of straightforwardly integrating data sets generated in multi-domain studies (or from different studies, perhaps via public databases). Happily, there are groups addressing all three areas that can offer coordination.
The MIBBI project (http://www.mibbi.org)15 promotes ongoing minimum information guidelines projects (to the wider community, and crucially, to each other) through its ‘Portal’ page. Moreover, MIBBI will soon produce a suite of modular guidelines covering a wide range of domains through its ‘Foundry’ activity. These complementary modules will be based on pre-existing community-sponsored recommendations, but will be designed so as to work well together, however they are combined.
For data formats, there are two ways to tackle incompatibility. The first is to develop new formats using a common design template (non trivial); the second approach is to provide a common format for generic workflow features (such as project aims, sample source, description of a person, etc.) that can also ‘wrap up’ other formats (i.e., holding a description of the file, and either specifying its location or physically including it within the common-format file). The FuGE model16 can operate in both these modes; as a ‘wrapper’ for other formats, and as the foundational model for any number of domain-specific formats. For example, the GelML format, described above, is built on the FuGE model.
ISA-Tab is a cross-domain tab-delimited format that holds high-level workflow descriptions and pointers to data files (all would normally travel together in an overarching archive file). The motivation for using a tab-delimited format is that familiar spreadsheet software can be used for viewing and editing, removing the dependence on the provision of new custom software. ISA-Tab and the ISAcreator tool (http://isatab.sourceforge.net/isacreator.html) will primarily be used to feed the BioInvestigation Index (http://www.ebi.ac.uk/bioinvindex), which in turn will ‘vertically integrate’ a number of domain-specific databases.
Controlled vocabularies can be rather informal artefacts, often structured around the file format they support. When restructured as ontology’s (which cluster concepts according to type) it becomes possible to harmonise and even integrate term sets being used by different bioscience communities. The OBO Foundry17 is a registry for ontology’s much like the MIBBI Portal (q.v.); but it does not seek to produce new or refactored ontology’s, preferring to let member projects decide how overlaps and inconsistencies should be resolved on an ad hoc basis. However, the Ontology of Biomedical Investigations (OBI)18, itself a core OBO Foundry ontology, takes an approach more similar to that of the MIBBI Foundry, in that it maintains a single ontology that can be used by many kinds of biological and biomedical scientist, offering the chance to interrelate data sets based solely on the semantics of the terms used.
The HUPO Proteomics Standards Initiative continues to develop and release reporting standards for proteomics. The ongoing voluntary effort by this constructively diverse group has found success in providing useful and appropriate standards to the community it serves; several of which have been the subject of significant publications. Furthermore, the PSI participates in (and in the case of MIBBI, initiated15) a series of cross-domain projects enabling the creation of reports on multi-domain investigations; for example, combining habitat data, microarray transcriptomics data, protein expression data and measurements of metabolic fluxes to create a real biological overview of a system.
- Quackenbush, J. Standardizing the standards. Mol Syst Biol 2, 2006.0010 (2006).
- Anon. Nat Methods 3(6), 415 (2006).
- Hermjakob, H., Apweiler, R. The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible. Expert Rev Proteomics 3(1), 1-3 (2006).
- Hermjakob, H. The HUPO Proteomics Standards Initiative – Overcoming the Fragmentation of Proteomics Data. Practical Proteomics 2-1, 34-38 (2006).
- Hanash, S., Celis, J.E. The Human Proteome Organization: a mission to advance proteome knowledge. Mol Cell Proteomics 1(6), 413-414 (2002).
- Taylor, C.F., Paton, N.W., Lilley, K.S., Binz, P.-A., Julian, Jr., R.K. Jones, A.R., Zhu, W., Apeiler, R., Aebersold, R., Deutsch, E.W., Macht, M., Mann, M., Neubert, T.A., Patterson, S.D., Seymour, S.L., Tsugita, A., Xenarios, I., Hermjakob, H. The Minimum Information About a Proteomics Experiment (MIAPE). Nature Biotechnol 25, 887 – 893 (2007) .
- Chris F Taylor, Pierre-Alain Binz, Ruedi Aebersold, Michel Affolter, Robert Barkovich, Eric W Deutsch, David M Horn, Andreas Hühmer, Martin Kussmann, Kathryn Lilley, Marcus Macht, Matthias Mann, Dieter Müller, Thomas A Neubert, Janice Nickson, Scott D Patterson, Roberto Raso, Kathryn Resing, Sean L Seymour, Akira Tsugita, Ioannis Xenarios, Rong Zeng & Randall K Julian, Jr. Guidelines for reporting the use of mass spectrometry in proteomics. Nature Biotechnol 26, 860-861 (2008).
- Pierre-Alain Binz, Robert Barkovich, Ronald C Beavis, David Creasy, David M Horn, Randall K Julian, Jr, Sean L Seymour, Chris F Taylor & Yves Vandenbrouck. Guidelines for reporting the use of mass spectrometry informatics in proteomics. Nature Biotechnol 26, 862 (2008).
- Frank Gibson, Leigh Anderson, Gyorgy Babnigg, Mark Baker, Matthias Berth, Pierre-Alain Binz, Andy Borthwick, Phil Cash, Billy W Day, David B Friedman, Donita Garland, Howard B Gutstein, Christine Hoogland, Neil A Jones, Alamgir Khan, Joachim Klose, Angus I Lamond, Peter F Lemkin, Kathryn S Lilley, Jonathan Minden, Nicholas J Morris, Norman W Paton, Michael R Pisano, John E Prime, Thierry Rabilloud, David A Stead, Chris F Taylor, Hans Voshol, Anil Wipat & Andrew R Jones. Guidelines for reporting the use of gel electrophoresis in proteomics. Nature Biotechnol 26, 863-864 (2008).
- Orchard S, Salwinski L, Kerrien S, Montecchi-Palazzi L, Oesterheld M, Stümpflen V, Ceol A, Chatr-aryamontri A, Armstrong J, Woollard P, Salama JJ, Moore S, Wojcik J, Bader GD, Vidal M, Cusick ME, Gerstein M, Gavin AC, Superti-Furga G, Greenblatt J, Bader J, Uetz P, Tyers M, Legrain P, Fields S, Mulder N, Gilson M, Niepmann M, Burgoon L, De Las Rivas J, Prieto C, Perreau VM, Hogue C, Mewes HW, Apweiler R, Xenarios I, Eisenberg D, Cesareni G, Hermjakob H. The minimum information required for reporting a molecular interaction experiment (MIMIx). Nature Biotechnol 25(8), 894-898 (2008).
- Pedrioli PG, Eng JK, Hubley R, Vogelzang M, Deutsch , Raught B, Pratt B, Nilsson E, Angeletti RH, Apweiler R, Cheung K, Costello CE, Hermjakob H, Huang S, Julian RK, Kapp E, McComb ME, Oliver SG, Omenn G, Paton NW, Simpson R, Smith R, Taylor CF, Zhu W, Aebersold R. A common open representation of mass spectrometry data and its application to proteomics research. Nature Biotechnol 22(11), 1459-66 (2004).
- Jones, A.R., Gibson, F. An Update on Data Standards for Gel Electrophoresis. Proteomics 7(S1), 35-40 (2007).
- Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, Brazma A, Brinkman RR, Michael Clark A, Deutsch EW, Fiehn O, Fostel J, Ghazal P, Gibson F, Gray T, Grimes G, Hancock JM, Hardy NW, Hermjakob H, Julian RK Jr, Kane M, Kettner C, Kinsinger C, Kolker E, Kuiper M, Le Novère N, Leebens-Mack J, Lewis SE, Lord P, Mallon AM, Marthandan N, Masuya H, McNally R, Mehrle A, Morrison N, Orchard S, Quackenbush J, Reecy JM, Robertson DG, Rocca-Serra P, Rodriguez H, Rosenfelder H, Santoyo-Lopez J, Scheuermann RH, Schober D, Smith B, Snape J, Stoeckert CJ Jr, Tipton K, Sterk P, Untergasser A, Vandesompele J, Wiemann S. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnol 26(8), 889-896 (2008).
- Jones AR, Pizarro A, Spellman P, Miller M, FuGE Working Group. FuGE: Functional Genomics Experiment Object Model. OMICS 10(2), 179-184 (2006).
- Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, OBI Consortium, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuermann RH, Shah N, Whetzel PL, Lewis S. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnol 25(11), 1251-1255 (2007).