Third in a series on data management.
Anyone who has traveled internationally for business knows that standards are important. Before traveling to a new country, one needs to figure out the shape of the electrical outlet plugs in order to be able to charge laptops and cell phones. Brazil, I recently learned, isn’t even standardized within the country. Different outlets may have different sized pins and even different voltages.
Data standards are critical for nearly everything one might want to do with data: sharing, exchange, integration, publishing, and reproducible research. Notably absent from that list, unfortunately, is the most common operation: analysis. Data standards are not generally necessary for the initial primary analysis of a dataset, and so data producers are not generally motivated to go through the trouble of identifying, learning, and implementing data standards. But that’s a topic for another blog post…
Of course, one can have too much of a good thing. The problem is captured brilliantly in the cartoon below, one of my all-time XKCD favorites.
As Andrew Tanenbaum said, according to wikiquote.org: “the nice things about standards is that you have so many to choose from.”
A few years ago I had reason to explore the data standards landscape in the context of omics data, which led to a paper titled “A sea of standards for omics data: sink or swim?” Our main conclusions were:
- Even the experts don’t always agree on what constitutes a standard.
- There are too many standards out there, and it is too complex and dynamic a landscape for a mere mortal to navigate.
- There is sometimes (often) no single “right” standard to use- the best standard for a given situation depends on the use case.
- Resources are needed to assist would-be standards adopters to navigate this confusing and ever-changing landscape.
As part of our exploration, we set out to identify candidates for standards to be used in storing and exchanging a genomic dataset, the type of data with which I was most familiar at the time. We used biosharing.org’s classification scheme for what constitutes a data standard:
- Content standard — minimum information checklist
- Semantic standard — ontologies and terminologies
- Syntax standard — format for data exchange.
We were able to identify 15 different standards that seemed potentially applicable. (15!!) Three of those standards turned out to have been deprecated, but that was not readily apparent to the non-expert.
Discovery metabolomics data standards: the wild west
Recently I have become involved in a project involving metabolomics data. One thought has gone through my head multiple times: “…and we thought genomics data management was hard.” I can identify far fewer standards in the metabolomics space. That’s a good thing, I suppose. At the most basic level, I sought a “minimum information” checklist. CIMR (Core Information for Metabolomics Reporting) seems to be the agreed upon content standard. But try as I might, I couldn’t seem to find documentation in the form of a table or bulleted list enumerating components such as raw data, biological metadata, experimental metadata, and processed results. The CIMR web page points to 7 different PDF files each of which describes in great detail the components needed for different categories of metabolomics experiments: mammalian, microbial, plants, environment, etc. It would seem more ontologically appropriate to specify a minimum set of data elements across ANY biological experiment in those areas, and not have them tangled up in metabolomics specifications. Unfortunately, data standards are developed by the people who show up, and often those people are only able to show up for the duration of a funded project. The best laid plans and the best managed projects can only get so far in this ever changing technology landscape, with scraps of percent effort and competing priorities.
Regarding semantic standards, there are again many to choose from in metabolomics. Unique identifiers include: InChI ID, InChI Key, PubChem compound ID, KEGG ID, SMILES, ChemSpider ID, LipidMaps ID, and ChEBI ID, among others. InChI Key is somewhat of a gold standard, BUT here’s where things can get tricky. Mass spectrometry, a primary technology used for metabolomics, cannot differentiate between certain molecules. For example, if you have two molecules that are largely identical except for the location of the double bonds in a long carbon chain, those molecules have unique InChI IDs, but mass spectrometry cannot differentiate between them. One might compare the challenge to a DNA microarray with a probe that it not unique to a single gene but rather is complementary to multiple different genes. Even if you have decided to use the GenBank accession number as an identifier, you cannot specify one or the other gene using that assay. The ChEBI ontology has some degree of hierarchy and subsumption, which in theory could help address this problem, but it is nowhere near complete.
Finally, exchange format has been a challenge. Labs tend to have their own respective ways of generating, and sharing, results data. In theory, exchange formats such as ISA-Tab and mzML are suitable for metabolomics data, but in practice they are rarely used. Datasets are often saved as Excel spreadsheets, often with multiple worksheets within those spreadsheets. They are human-readable, but not easily machine-readable. Some use the fill color of cells to impart semantic meaning (e.g. “Valid”, “Internal standard out range”.) One file I came across was 4 megabytes in size and took several minutes to open, when it opened at all. (Often it crashed Excel or reported that the file was corrupt.) Within a single worksheet, in some cases rows represented samples while columns represented molecular species, and in some cases that was reversed. In one instance, the second row contained column names and actual values started at row 17. Rows 4-14 held QC data starting only after the 11th column. The first column of the first several rows was used for metadata regarding color coding, and which samples were not included due to insufficient volume for analysis.
Not only are the file contents at issue, but their formats as well. The ideal would be both human and machine readable, though in practice, some degree of compromise is required for both.
The Road Ahead
There is reason for hope. The Metabolomics Standards Initiative began in 2005, with 5 working groups for the areas of biological context, chemical analysis, data processing, ontology, and data exchange. They developed the CIMR specifications described above as well as a tiered system of metabolite identification:
- Identified metabolites
- Putatively annotated compounds
- Putatively characterized compound classes
- Unknown compounds
In 2012 an initiative called COSMOS (COordination Of Standards In MetabOlomicS) was funded in the EU. This group is coordinated by EMBL-EBI and comprises several European metabolomics data providers. But funding for the effort is limited, and standards are an ongoing chore. Until the appropriate incentives are in place for both standards developers and would-be standards users, this aspect of data management faces an uphill battle.
— Jessie Tenenbaum, PhD
Heading images by (WT-shared) Shoestring at wts wikivoyage [CC BY-SA 4.0-3.0-2.5-2.0-1.0 (http://creativecommons.org/licenses/by-sa/4.0-3.0-2.5-2.0-1.0)], via Wikimedia Commons