Wednesday 10th February p.m. - Room 516

Semantic Data Exchange

  • Gos Micklem
  • Richard Smith
  • James Taylor
  • Arek Kasprzyk
  • Soichi Ogishima
  • Brad Chapman
  • Christian Zmasek
  • Peter Cock
  • Hideyuki Morita
  • Ryosuke Ishiwata
  • Kei Ono
  • Shinobu Okamoto
  • Alberto Labarga

Discussion on possibilities/need for improving data exchange between e.g. InterMine, Galaxy, BioMart, Cytoscape…

Would typing of arbitrary data exchange improve communication between cytoscape/ biomart/ intermine/galaxy and reduce need for glue code/ make life easier/ more robust?

Current situation:

It was felt that the current situation wasn't so bad: however it would be nice to avoid bespoke plugins for communication.

Galaxy ( it is the responsibility of the user to make sure that the type of data used is correct for the analysis being performed.

Galaxy/BioMart ( communication: neither biomart or Galaxy deal with semantics and communication works fine.

Galaxy/UCSC table browser ( <details needed>

Cytoscape ( & BioMart: REST API used. Perhaps SOAP would be better but this still requires some development. REST was fine for data retrieval.

BioMart: the Mart deployer decides on the meta-data layer in biomart - the GUI uses this layer rather than the data. It assumes that the user understands what you are talking about. Sometimes points to ontologies. Gui provides number of options but there is nothing to stop the user mis-naming things. cf UCSC table browser: BioMart? has place-holders for column descriptions, though current these aren't used.

Interoperation of Marts: this is the only place where must get the semantics correct. If one mart calls something a UniProt identifier and the other one does too then essential that they are refering to the same identifier. Perhaps would be good to have controlled name-space for this and/or a hand-shake to check that do have matching values.

InterMine ( multiple organisms can use the same identifiers e.g. across 12 drosophila genomes. InterMines can ask each other what data they provide but can't be sure name-spaces are compatible. Would be nice if InterMine or BioMart systems can talk and discover what they have and how they could communicate. Agreed that would be good to have more formal description so that intended InterMine talking to BioMart backend could be easier.

Discussed whether passing a header for column-based data would be useful.

Available data-describing controlled vocabularies: OICR cancer data experience is that there are rather limited naming systems.

Thought to be a good idea to expose/ export current naming systems. The Cancer Genome Atlas (TCGA: have done some thinking along these lines.

Galaxy: has xml to describe file formats: biopython/bioperl/bioruby/biojava have more-or-less agreed filenames.

Just thinking of FASTA format for sequence there are quite a number of Flavours:

  • DNA vs protein sequences
  • use of ambiguity codes or not
  • softmasked sequence vs N-masking

Likewise for files that transmit data about genome features:

  • 0-based coordinates
  • 1-based coordinates
  • in-between-base coordinates

A naming scheme would be useful to capture this kind of complexity.

Would be useful to have namespace that allows one to assert something is a particular file format. With an appropriate URI one would know what kind of file was arriving. Likewise if receive column of identifiers from two sources it would be useful to know that they are referring to the same thing.

File formats vs labelling of data itself:

BioMart/ InterMine can both provide column meaning "GO identifier", but where to assert they are the same thing?

GO_ID + URI is all needed

Is there a need for a central naming authority/ namespace provider? Given work on UniProt already, is EBI an natural location for this? DDBJ/ DBCLS/ NCBI?

In a practical sense: if we want to take notions of semantics that are already applied to RDF: what do we apply to a column of data? Want a standard identifier for saying a column is a column of GO identifiers.

If every item in a column has the same RDF type, then can attach the URI of that type to the column and that is the unique identifier that we can use to be clear about what the column contents are.

Agreed that we should use consistent URIs to label columns.

Kei: PSI-MI EBI website: global definitions:

   --> database citation
     --> feature database
       --> gene ontology (double click for definition)

Semantics needs to be regulated regardless of the technology (So RDF isn't necessarily the point here)

Agreed that would be good if allow semantics to be published/ exchanged rather than absolute requirement. Absolute requirement causes stasis and makes things harder. If make it easy then especially large-scale users/providers can start to comply.

Data exchange conclusions:

  • A namespace for file formats would be useful.
  • A namespace for column of tabular data would be useful. Could also be used to describe data in other formats e.g. XML, though this could be rather verbose.
  • Investigate whether the above exist.
  • At the moment namespaces for columns is probably more important than URIs for each data element in a column.
  • Agreed that worthwhile to pass URIs to describe columns. Agreed that arbitrary human-friendly names are also good.
  • Agreed to dump all BioMart/ InterMine column headings out, find the common/commonly-used ones and work on naming.

Discussion turned to genome builds:

There is no-where to go to find out if entities/ coordinates come from the same versions of genomes. Agreed Versioning is important.

BioMart/ UCSC do have versions available but not necessarily using the same namespaces.

biomart has place-holders for versions and could easily expose these.

Issue with resources generated from'old' genome versions e.g. affy chips: difficult to force people to use just one version of the genome.

Can make gene identifiers unique by organism-specific prefix, or by qualifier.

Ensembl ( does a good job and plans on supporting all genomes. ensembl: systematic about mapping their versions to others e.g. from UCSC.

Assembly version and ensembl gene-build version are sufficient to resolve all ambiguities.

modMine ( the modENCODE project records the genome against which data were generated. Liftover of data between genomes is provided. Currently, when export data, does not record the genome version.

Galaxy save the data orginally sent - this makes things simpler in the sense that the data actually used for some analysis are available indefinitely. Galaxy remembers what analysis was done but users may not remember why/ what doing. Galaxy encourages people to record the genome build version.

Genome version summary:

  • Investigate whether is there a standard available for describing genome version
  • Consider whether to base naming on ensembl genome/ annotation versions

Thoughts on RDF:

If everyone is expressing their data in RDF with a common underlying naming scheme then large-scale data integration will be easier, whether with a conventional warehouse or with a triple store.

Data warehouses are perhaps the wrong place to start with RDF: should be the orginal data sources. If warehouses self-define identifiers then potential problems if/when original sources start generating their own identifiers.

For modENCODE data/ OICR cancer genome data we are the data originators so could generate IDs.

modENCODE as example: would we provide data as RDF? Could there be advantages? How would the community feel about this? RDF is vehicle: requires ontologies. Felt that the first thing to try would be to represent meta-data as RDF.

Agreed that production-oriented groups have to deliver things now to the research community - but awareness of semantics is important. Balance between cleanliness/ rigour and getting job down now.

In some senses RDF vs databases is not an either/ or. RDF view: does force once to think about defining things very clearly.

Care is needed: can do RDF badly and will still work (locally) but not good for interoperation. Similar situation with ontology use in databases.