Changes between Initial Version and Version 1 of Data_exchange

2010/02/12 16:37:36 (13 years ago)



  • Data_exchange

    v1 v1  
     2Wednesday 10th February p.m. - Room 516 
     4Semantic Data Exchange 
     6Gos Micklem 
     7Richard Smith 
     8James Taylor 
     9Arek Kasprzyk 
     10Soichi Ogishima 
     11Brad Chapman 
     12Christian Zmasek 
     13Peter Cock 
     14Hideyuki Morita 
     15Ryosuke Ishiwata 
     16Kei Ono 
     17Shinobu Okamoto 
     18Alberto Labarga 
     21Discussion on possibilities/need for improving data exchange between 
     22e.g. InterMine, Galaxy, BioMart, Cytoscape... 
     24Would typing of arbitrary data exchange improve communication between 
     25cytoscape/ biomart/ intermine/galaxy and reduce need for glue code/ 
     26make life easier/ more robust? 
     29Current situation: 
     31It was felt that the current situation wasn't so bad: however it would 
     32be nice to avoid bespoke plugins for communication. 
     34   Galaxy ( it is the responsibility of the user to make sure that the 
     35   type of data used is correct for the analysis being performed. 
     37   Galaxy/BioMart ( communication: neither biomart or Galaxy deal with 
     38   semantics and communication works fine. 
     40   Galaxy/UCSC table browser 
     41   ( <details 
     42   needed> 
     44   Cytoscape ( REST API used.  Perhaps SOAP would be better but 
     45   this still requires some development.  REST was fine for data 
     46   retrieval. 
     48   BioMart: the Mart deployer decides on the meta-data layer in 
     49   biomart - the GUI uses this layer rather than the data.  It assumes 
     50   that the user understands what you are talking about.  Sometimes 
     51   points to ontologies.  Gui provides number of options but there is 
     52   nothing to stop the user mis-naming things.  cf UCSC table browser: 
     53   BioMart has place-holders for column descriptions, though current 
     54   these aren't used. 
     56    Interoperation of Marts: this is the only place where must get the 
     57     semantics correct.  If one mart calls something a UniProt 
     58     identifier and the other one does too then essential that they are 
     59     refering to the same identifier.  Perhaps would be good to have 
     60     controlled name-space for this and/or a hand-shake to check that do 
     61     have matching values. 
     63   InterMine ( multiple organisms can use the same identifiers 
     64   e.g. across 12 drosophila genomes.  InterMines can ask each other 
     65   what data they provide but can't be sure name-spaces are 
     66   compatible.  Would be nice if InterMine or BioMart systems can talk 
     67   and discover what they have and how they could communicate.  Agreed 
     68   that would be good to have more formal description so that intended 
     69   InterMine talking to BioMart backend could be easier. 
     72Discussed whether passing a header for column-based data would be 
     75   Available data-describing controlled vocabularies: OICR cancer data 
     76   experience is that there are rather limited naming systems. 
     78   Thought to be a good idea to expose/ export current naming systems. 
     79   The Cancer Genome Atlas (TCGA: have done 
     80   some thinking along these lines. 
     82   Galaxy: has xml to describe file formats: 
     83   biopython/bioperl/bioruby/biojava have more-or-less agreed 
     84   filenames. 
     86Just thinking of FASTA format for sequence there are quite a number of 
     88 - DNA vs protein sequences 
     89 - use of ambiguity codes or not 
     90 - softmasked sequence vs N-masking 
     92Likewise for files that transmit data about genome features: 
     93 - 0-based coordinates 
     94 - 1-based coordinates 
     95 - in-between-base coordinates 
     97A naming scheme would be useful to capture this kind of complexity. 
     99Would be useful to have namespace that allows one to assert something 
     100is a particular file format.  With an appropriate URI one would know 
     101what kind of file was arriving.  Likewise if receive column of 
     102identifiers from two sources it would be useful to know that they are 
     103referring to the same thing. 
     105File formats vs labelling of data itself: 
     107BioMart/ InterMine can both provide column meaning "GO identifier", 
     108but where to assert they are the same thing? 
     110GO_ID + URI is all needed 
     112Is there a need for a central naming authority/ namespace provider? 
     113Given work on UniProt already, is EBI an natural location for this? 
     114DDBJ/ DBCLS/ NCBI? 
     116In a practical sense: if we want to take notions of semantics that are 
     117already applied to RDF: what do we apply to a column of data?  Want a 
     118standard identifier for saying a column is a column of GO identifiers. 
     120If every item in a column has the same RDF type, then can attach the 
     121URI of that type to the column and that is the unique identifier that 
     122we can use to be clear about what the column contents are. 
     124Agreed that we should use consistent URIs to label columns. 
     126Kei: PSI-MI EBI website:  global definitions: 
     129   molecular-interaction 
     130   --> database citation 
     131     --> feature database 
     132       --> gene ontology (double click for definition) 
     135Semantics needs to be regulated regardless of the technology (So RDF 
     136isn't necessarily the point here) 
     138Agreed that would be good if allow semantics to be published/ exchanged 
     139rather than absolute requirement.  Absolute requirement causes stasis 
     140and makes things harder.  If make it easy then especially 
     141large-scale users/providers can start to comply. 
     144Data exchange conclusions: 
     145*** A namespace for file formats would be useful. 
     147*** A namespace for column of tabular data would be useful.  Could 
     148    also be used to describe data in other formats e.g. XML, though 
     149    this could be rather verbose. 
     150*** Investigate whether the above exist. 
     151    Ontology Lookup Service ( 
     152    and/orLife Science Resource Name Project ( 
     153    applicable ? 
     155*** At the moment namespaces for columns is probably more important 
     156    than URIs for each data element in a column. 
     158Agreed that worthwhile to pass URIs to describe columns.  Agreed that 
     159arbitrary human-friendly names are also good. 
     161*** Agreed to dump all BioMart/ InterMine column headings out, find 
     162    the common/commonly-used ones and work on naming. 
     166========== Discussion turned to genome builds: 
     168There is no-where to go to find out if entities/ coordinates come from 
     169the same versions of genomes.  Agreed Versioning is important. 
     171  BioMart/ UCSC do have versions 
     172  available but not necessarily using the same namespaces. 
     174  biomart has place-holders for versions and could easily expose these. 
     176Issue with resources generated from'old' genome versions e.g. affy 
     177chips: difficult to force people to use just one version of the 
     180Can make gene identifiers unique by organism-specific prefix, or by 
     183Ensembl ( does a good job and plans on 
     184supporting all genomes.  ensembl: systematic 
     185about mapping their versions to others e.g. from UCSC. 
     187Assembly version and ensembl gene-build version are sufficient to 
     188resolve all ambiguities. 
     191modMine ( the modENCODE project 
     192records the genome against which data were generated.  Liftover of 
     193data between genomes is provided.  Currently, when export data, does 
     194not record the genome version. 
     197Galaxy save the data orginally sent - this makes things simpler in the 
     198sense that the data actually used for some analysis are available 
     199indefinitely.  Galaxy remembers what analysis was done but users may not remember why/ what 
     200doing.  Galaxy encourages people to record the genome build 
     203Genome version summary: 
     204* Investigate whether is there a standard available for describing genome version * 
     205* Consider whether to base naming on ensembl genome/ annotation versions * 
     209=========== Thoughts on RDF: 
     211If everyone is expressing their data in RDF with a common underlying 
     212naming scheme then large-scale data integration will be easier, 
     213whether with a conventional warehouse or with a triple store. 
     215Data warehouses are perhaps the wrong place to start with RDF: should 
     216be the orginal data sources.  If warehouses self-define identifiers 
     217then potential problems if/when original sources start generating 
     218their own identifiers. 
     220For modENCODE data/ OICR cancer genome data we are the data 
     221originators so could generate IDs. 
     223modENCODE as example: would we provide data as RDF?  Could there be 
     224advantages?  How would the community feel about this?  RDF is vehicle: 
     225requires ontologies.  Felt that the first thing to try would be to 
     226represent meta-data as RDF. 
     228Agreed that production-oriented groups have to deliver things now to 
     229the research community - but awareness of semantics is important. 
     230Balance between cleanliness/ rigour and getting job down now. 
     232In some senses RDF vs databases is not an either/ or.  RDF view: does 
     233force once to think about defining things very clearly. 
     235Care is needed: can do RDF badly and will still work (locally) but not 
     236good for interoperation.  Similar situation with ontology use in