Changes between Initial Version and Version 1 of Data_exchange

Show
Ignore:
Timestamp:
2010/02/12 16:37:36 (14 years ago)
Author:
gmicklem
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Data_exchange

    v1 v1  
     1 
     2Wednesday 10th February p.m. - Room 516 
     3 
     4Semantic Data Exchange 
     5 
     6Gos Micklem 
     7Richard Smith 
     8James Taylor 
     9Arek Kasprzyk 
     10Soichi Ogishima 
     11Brad Chapman 
     12Christian Zmasek 
     13Peter Cock 
     14Hideyuki Morita 
     15Ryosuke Ishiwata 
     16Kei Ono 
     17Shinobu Okamoto 
     18Alberto Labarga 
     19 
     20------------------------------------------------------------------------------ 
     21Discussion on possibilities/need for improving data exchange between 
     22e.g. InterMine, Galaxy, BioMart, Cytoscape... 
     23 
     24Would typing of arbitrary data exchange improve communication between 
     25cytoscape/ biomart/ intermine/galaxy and reduce need for glue code/ 
     26make life easier/ more robust? 
     27 
     28 
     29Current situation: 
     30 
     31It was felt that the current situation wasn't so bad: however it would 
     32be nice to avoid bespoke plugins for communication. 
     33 
     34   Galaxy (http://main.g2.bx.psu.edu/): it is the responsibility of the user to make sure that the 
     35   type of data used is correct for the analysis being performed. 
     36 
     37   Galaxy/BioMart (http://www.biomart.org) communication: neither biomart or Galaxy deal with 
     38   semantics and communication works fine. 
     39 
     40   Galaxy/UCSC table browser 
     41   (http://genome.ucsc.edu/cgi-bin/hgTables?command=start): <details 
     42   needed> 
     43 
     44   Cytoscape (http://www.cytoscape.org)/BioMart: REST API used.  Perhaps SOAP would be better but 
     45   this still requires some development.  REST was fine for data 
     46   retrieval. 
     47 
     48   BioMart: the Mart deployer decides on the meta-data layer in 
     49   biomart - the GUI uses this layer rather than the data.  It assumes 
     50   that the user understands what you are talking about.  Sometimes 
     51   points to ontologies.  Gui provides number of options but there is 
     52   nothing to stop the user mis-naming things.  cf UCSC table browser: 
     53   BioMart has place-holders for column descriptions, though current 
     54   these aren't used. 
     55 
     56    Interoperation of Marts: this is the only place where must get the 
     57     semantics correct.  If one mart calls something a UniProt 
     58     identifier and the other one does too then essential that they are 
     59     refering to the same identifier.  Perhaps would be good to have 
     60     controlled name-space for this and/or a hand-shake to check that do 
     61     have matching values. 
     62 
     63   InterMine (http://www.intermine.org): multiple organisms can use the same identifiers 
     64   e.g. across 12 drosophila genomes.  InterMines can ask each other 
     65   what data they provide but can't be sure name-spaces are 
     66   compatible.  Would be nice if InterMine or BioMart systems can talk 
     67   and discover what they have and how they could communicate.  Agreed 
     68   that would be good to have more formal description so that intended 
     69   InterMine talking to BioMart backend could be easier. 
     70 
     71 
     72Discussed whether passing a header for column-based data would be 
     73useful. 
     74 
     75   Available data-describing controlled vocabularies: OICR cancer data 
     76   experience is that there are rather limited naming systems. 
     77 
     78   Thought to be a good idea to expose/ export current naming systems. 
     79   The Cancer Genome Atlas (TCGA: http://cancergenome.nih.gov/) have done 
     80   some thinking along these lines. 
     81 
     82   Galaxy: has xml to describe file formats: 
     83   biopython/bioperl/bioruby/biojava have more-or-less agreed 
     84   filenames. 
     85 
     86Just thinking of FASTA format for sequence there are quite a number of 
     87Flavours: 
     88 - DNA vs protein sequences 
     89 - use of ambiguity codes or not 
     90 - softmasked sequence vs N-masking 
     91 
     92Likewise for files that transmit data about genome features: 
     93 - 0-based coordinates 
     94 - 1-based coordinates 
     95 - in-between-base coordinates 
     96 
     97A naming scheme would be useful to capture this kind of complexity. 
     98 
     99Would be useful to have namespace that allows one to assert something 
     100is a particular file format.  With an appropriate URI one would know 
     101what kind of file was arriving.  Likewise if receive column of 
     102identifiers from two sources it would be useful to know that they are 
     103referring to the same thing. 
     104 
     105File formats vs labelling of data itself: 
     106 
     107BioMart/ InterMine can both provide column meaning "GO identifier", 
     108but where to assert they are the same thing? 
     109 
     110GO_ID + URI is all needed 
     111 
     112Is there a need for a central naming authority/ namespace provider? 
     113Given work on UniProt already, is EBI an natural location for this? 
     114DDBJ/ DBCLS/ NCBI? 
     115 
     116In a practical sense: if we want to take notions of semantics that are 
     117already applied to RDF: what do we apply to a column of data?  Want a 
     118standard identifier for saying a column is a column of GO identifiers. 
     119 
     120If every item in a column has the same RDF type, then can attach the 
     121URI of that type to the column and that is the unique identifier that 
     122we can use to be clear about what the column contents are. 
     123 
     124Agreed that we should use consistent URIs to label columns. 
     125 
     126Kei: PSI-MI EBI website:  global definitions: 
     127http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI 
     128 
     129   molecular-interaction 
     130   --> database citation 
     131     --> feature database 
     132       --> gene ontology (double click for definition) 
     133      http://www.ebi.ac.uk/ontology-lookup/?termId=MI%3A0448 
     134 
     135Semantics needs to be regulated regardless of the technology (So RDF 
     136isn't necessarily the point here) 
     137 
     138Agreed that would be good if allow semantics to be published/ exchanged 
     139rather than absolute requirement.  Absolute requirement causes stasis 
     140and makes things harder.  If make it easy then especially 
     141large-scale users/providers can start to comply. 
     142 
     143 
     144Data exchange conclusions: 
     145*** A namespace for file formats would be useful. 
     146 
     147*** A namespace for column of tabular data would be useful.  Could 
     148    also be used to describe data in other formats e.g. XML, though 
     149    this could be rather verbose. 
     150*** Investigate whether the above exist. 
     151    Ontology Lookup Service (http://www.ebi.ac.uk/ontology-lookup) 
     152    and/orLife Science Resource Name Project (http://www.lsrn.org) 
     153    applicable ? 
     154 
     155*** At the moment namespaces for columns is probably more important 
     156    than URIs for each data element in a column. 
     157 
     158Agreed that worthwhile to pass URIs to describe columns.  Agreed that 
     159arbitrary human-friendly names are also good. 
     160 
     161*** Agreed to dump all BioMart/ InterMine column headings out, find 
     162    the common/commonly-used ones and work on naming. 
     163 
     164 
     165 
     166========== Discussion turned to genome builds: 
     167 
     168There is no-where to go to find out if entities/ coordinates come from 
     169the same versions of genomes.  Agreed Versioning is important. 
     170 
     171  BioMart/ UCSC do have versions 
     172  available but not necessarily using the same namespaces. 
     173 
     174  biomart has place-holders for versions and could easily expose these. 
     175 
     176Issue with resources generated from'old' genome versions e.g. affy 
     177chips: difficult to force people to use just one version of the 
     178genome. 
     179 
     180Can make gene identifiers unique by organism-specific prefix, or by 
     181qualifier. 
     182 
     183Ensembl (http://www.ensembl.org) does a good job and plans on 
     184supporting all genomes.  ensembl: systematic 
     185about mapping their versions to others e.g. from UCSC. 
     186 
     187Assembly version and ensembl gene-build version are sufficient to 
     188resolve all ambiguities. 
     189 
     190 
     191modMine (http://intermine.modencode.org): the modENCODE project 
     192records the genome against which data were generated.  Liftover of 
     193data between genomes is provided.  Currently, when export data, does 
     194not record the genome version. 
     195 
     196 
     197Galaxy save the data orginally sent - this makes things simpler in the 
     198sense that the data actually used for some analysis are available 
     199indefinitely.  Galaxy remembers what analysis was done but users may not remember why/ what 
     200doing.  Galaxy encourages people to record the genome build 
     201version. 
     202 
     203Genome version summary: 
     204* Investigate whether is there a standard available for describing genome version * 
     205* Consider whether to base naming on ensembl genome/ annotation versions * 
     206 
     207 
     208 
     209=========== Thoughts on RDF: 
     210 
     211If everyone is expressing their data in RDF with a common underlying 
     212naming scheme then large-scale data integration will be easier, 
     213whether with a conventional warehouse or with a triple store. 
     214 
     215Data warehouses are perhaps the wrong place to start with RDF: should 
     216be the orginal data sources.  If warehouses self-define identifiers 
     217then potential problems if/when original sources start generating 
     218their own identifiers. 
     219 
     220For modENCODE data/ OICR cancer genome data we are the data 
     221originators so could generate IDs. 
     222 
     223modENCODE as example: would we provide data as RDF?  Could there be 
     224advantages?  How would the community feel about this?  RDF is vehicle: 
     225requires ontologies.  Felt that the first thing to try would be to 
     226represent meta-data as RDF. 
     227 
     228Agreed that production-oriented groups have to deliver things now to 
     229the research community - but awareness of semantics is important. 
     230Balance between cleanliness/ rigour and getting job down now. 
     231 
     232In some senses RDF vs databases is not an either/ or.  RDF view: does 
     233force once to think about defining things very clearly. 
     234 
     235Care is needed: can do RDF badly and will still work (locally) but not 
     236good for interoperation.  Similar situation with ontology use in 
     237databases.