| 1 | |
| 2 | Wednesday 10th February p.m. - Room 516 |
| 3 | |
| 4 | Semantic Data Exchange |
| 5 | |
| 6 | Gos Micklem |
| 7 | Richard Smith |
| 8 | James Taylor |
| 9 | Arek Kasprzyk |
| 10 | Soichi Ogishima |
| 11 | Brad Chapman |
| 12 | Christian Zmasek |
| 13 | Peter Cock |
| 14 | Hideyuki Morita |
| 15 | Ryosuke Ishiwata |
| 16 | Kei Ono |
| 17 | Shinobu Okamoto |
| 18 | Alberto Labarga |
| 19 | |
| 20 | ------------------------------------------------------------------------------ |
| 21 | Discussion on possibilities/need for improving data exchange between |
| 22 | e.g. InterMine, Galaxy, BioMart, Cytoscape... |
| 23 | |
| 24 | Would typing of arbitrary data exchange improve communication between |
| 25 | cytoscape/ biomart/ intermine/galaxy and reduce need for glue code/ |
| 26 | make life easier/ more robust? |
| 27 | |
| 28 | |
| 29 | Current situation: |
| 30 | |
| 31 | It was felt that the current situation wasn't so bad: however it would |
| 32 | be nice to avoid bespoke plugins for communication. |
| 33 | |
| 34 | Galaxy (http://main.g2.bx.psu.edu/): it is the responsibility of the user to make sure that the |
| 35 | type of data used is correct for the analysis being performed. |
| 36 | |
| 37 | Galaxy/BioMart (http://www.biomart.org) communication: neither biomart or Galaxy deal with |
| 38 | semantics and communication works fine. |
| 39 | |
| 40 | Galaxy/UCSC table browser |
| 41 | (http://genome.ucsc.edu/cgi-bin/hgTables?command=start): <details |
| 42 | needed> |
| 43 | |
| 44 | Cytoscape (http://www.cytoscape.org)/BioMart: REST API used. Perhaps SOAP would be better but |
| 45 | this still requires some development. REST was fine for data |
| 46 | retrieval. |
| 47 | |
| 48 | BioMart: the Mart deployer decides on the meta-data layer in |
| 49 | biomart - the GUI uses this layer rather than the data. It assumes |
| 50 | that the user understands what you are talking about. Sometimes |
| 51 | points to ontologies. Gui provides number of options but there is |
| 52 | nothing to stop the user mis-naming things. cf UCSC table browser: |
| 53 | BioMart has place-holders for column descriptions, though current |
| 54 | these aren't used. |
| 55 | |
| 56 | Interoperation of Marts: this is the only place where must get the |
| 57 | semantics correct. If one mart calls something a UniProt |
| 58 | identifier and the other one does too then essential that they are |
| 59 | refering to the same identifier. Perhaps would be good to have |
| 60 | controlled name-space for this and/or a hand-shake to check that do |
| 61 | have matching values. |
| 62 | |
| 63 | InterMine (http://www.intermine.org): multiple organisms can use the same identifiers |
| 64 | e.g. across 12 drosophila genomes. InterMines can ask each other |
| 65 | what data they provide but can't be sure name-spaces are |
| 66 | compatible. Would be nice if InterMine or BioMart systems can talk |
| 67 | and discover what they have and how they could communicate. Agreed |
| 68 | that would be good to have more formal description so that intended |
| 69 | InterMine talking to BioMart backend could be easier. |
| 70 | |
| 71 | |
| 72 | Discussed whether passing a header for column-based data would be |
| 73 | useful. |
| 74 | |
| 75 | Available data-describing controlled vocabularies: OICR cancer data |
| 76 | experience is that there are rather limited naming systems. |
| 77 | |
| 78 | Thought to be a good idea to expose/ export current naming systems. |
| 79 | The Cancer Genome Atlas (TCGA: http://cancergenome.nih.gov/) have done |
| 80 | some thinking along these lines. |
| 81 | |
| 82 | Galaxy: has xml to describe file formats: |
| 83 | biopython/bioperl/bioruby/biojava have more-or-less agreed |
| 84 | filenames. |
| 85 | |
| 86 | Just thinking of FASTA format for sequence there are quite a number of |
| 87 | Flavours: |
| 88 | - DNA vs protein sequences |
| 89 | - use of ambiguity codes or not |
| 90 | - softmasked sequence vs N-masking |
| 91 | |
| 92 | Likewise for files that transmit data about genome features: |
| 93 | - 0-based coordinates |
| 94 | - 1-based coordinates |
| 95 | - in-between-base coordinates |
| 96 | |
| 97 | A naming scheme would be useful to capture this kind of complexity. |
| 98 | |
| 99 | Would be useful to have namespace that allows one to assert something |
| 100 | is a particular file format. With an appropriate URI one would know |
| 101 | what kind of file was arriving. Likewise if receive column of |
| 102 | identifiers from two sources it would be useful to know that they are |
| 103 | referring to the same thing. |
| 104 | |
| 105 | File formats vs labelling of data itself: |
| 106 | |
| 107 | BioMart/ InterMine can both provide column meaning "GO identifier", |
| 108 | but where to assert they are the same thing? |
| 109 | |
| 110 | GO_ID + URI is all needed |
| 111 | |
| 112 | Is there a need for a central naming authority/ namespace provider? |
| 113 | Given work on UniProt already, is EBI an natural location for this? |
| 114 | DDBJ/ DBCLS/ NCBI? |
| 115 | |
| 116 | In a practical sense: if we want to take notions of semantics that are |
| 117 | already applied to RDF: what do we apply to a column of data? Want a |
| 118 | standard identifier for saying a column is a column of GO identifiers. |
| 119 | |
| 120 | If every item in a column has the same RDF type, then can attach the |
| 121 | URI of that type to the column and that is the unique identifier that |
| 122 | we can use to be clear about what the column contents are. |
| 123 | |
| 124 | Agreed that we should use consistent URIs to label columns. |
| 125 | |
| 126 | Kei: PSI-MI EBI website: global definitions: |
| 127 | http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI |
| 128 | |
| 129 | molecular-interaction |
| 130 | --> database citation |
| 131 | --> feature database |
| 132 | --> gene ontology (double click for definition) |
| 133 | http://www.ebi.ac.uk/ontology-lookup/?termId=MI%3A0448 |
| 134 | |
| 135 | Semantics needs to be regulated regardless of the technology (So RDF |
| 136 | isn't necessarily the point here) |
| 137 | |
| 138 | Agreed that would be good if allow semantics to be published/ exchanged |
| 139 | rather than absolute requirement. Absolute requirement causes stasis |
| 140 | and makes things harder. If make it easy then especially |
| 141 | large-scale users/providers can start to comply. |
| 142 | |
| 143 | |
| 144 | Data exchange conclusions: |
| 145 | *** A namespace for file formats would be useful. |
| 146 | |
| 147 | *** A namespace for column of tabular data would be useful. Could |
| 148 | also be used to describe data in other formats e.g. XML, though |
| 149 | this could be rather verbose. |
| 150 | *** Investigate whether the above exist. |
| 151 | Ontology Lookup Service (http://www.ebi.ac.uk/ontology-lookup) |
| 152 | and/orLife Science Resource Name Project (http://www.lsrn.org) |
| 153 | applicable ? |
| 154 | |
| 155 | *** At the moment namespaces for columns is probably more important |
| 156 | than URIs for each data element in a column. |
| 157 | |
| 158 | Agreed that worthwhile to pass URIs to describe columns. Agreed that |
| 159 | arbitrary human-friendly names are also good. |
| 160 | |
| 161 | *** Agreed to dump all BioMart/ InterMine column headings out, find |
| 162 | the common/commonly-used ones and work on naming. |
| 163 | |
| 164 | |
| 165 | |
| 166 | ========== Discussion turned to genome builds: |
| 167 | |
| 168 | There is no-where to go to find out if entities/ coordinates come from |
| 169 | the same versions of genomes. Agreed Versioning is important. |
| 170 | |
| 171 | BioMart/ UCSC do have versions |
| 172 | available but not necessarily using the same namespaces. |
| 173 | |
| 174 | biomart has place-holders for versions and could easily expose these. |
| 175 | |
| 176 | Issue with resources generated from'old' genome versions e.g. affy |
| 177 | chips: difficult to force people to use just one version of the |
| 178 | genome. |
| 179 | |
| 180 | Can make gene identifiers unique by organism-specific prefix, or by |
| 181 | qualifier. |
| 182 | |
| 183 | Ensembl (http://www.ensembl.org) does a good job and plans on |
| 184 | supporting all genomes. ensembl: systematic |
| 185 | about mapping their versions to others e.g. from UCSC. |
| 186 | |
| 187 | Assembly version and ensembl gene-build version are sufficient to |
| 188 | resolve all ambiguities. |
| 189 | |
| 190 | |
| 191 | modMine (http://intermine.modencode.org): the modENCODE project |
| 192 | records the genome against which data were generated. Liftover of |
| 193 | data between genomes is provided. Currently, when export data, does |
| 194 | not record the genome version. |
| 195 | |
| 196 | |
| 197 | Galaxy save the data orginally sent - this makes things simpler in the |
| 198 | sense that the data actually used for some analysis are available |
| 199 | indefinitely. Galaxy remembers what analysis was done but users may not remember why/ what |
| 200 | doing. Galaxy encourages people to record the genome build |
| 201 | version. |
| 202 | |
| 203 | Genome version summary: |
| 204 | * Investigate whether is there a standard available for describing genome version * |
| 205 | * Consider whether to base naming on ensembl genome/ annotation versions * |
| 206 | |
| 207 | |
| 208 | |
| 209 | =========== Thoughts on RDF: |
| 210 | |
| 211 | If everyone is expressing their data in RDF with a common underlying |
| 212 | naming scheme then large-scale data integration will be easier, |
| 213 | whether with a conventional warehouse or with a triple store. |
| 214 | |
| 215 | Data warehouses are perhaps the wrong place to start with RDF: should |
| 216 | be the orginal data sources. If warehouses self-define identifiers |
| 217 | then potential problems if/when original sources start generating |
| 218 | their own identifiers. |
| 219 | |
| 220 | For modENCODE data/ OICR cancer genome data we are the data |
| 221 | originators so could generate IDs. |
| 222 | |
| 223 | modENCODE as example: would we provide data as RDF? Could there be |
| 224 | advantages? How would the community feel about this? RDF is vehicle: |
| 225 | requires ontologies. Felt that the first thing to try would be to |
| 226 | represent meta-data as RDF. |
| 227 | |
| 228 | Agreed that production-oriented groups have to deliver things now to |
| 229 | the research community - but awareness of semantics is important. |
| 230 | Balance between cleanliness/ rigour and getting job down now. |
| 231 | |
| 232 | In some senses RDF vs databases is not an either/ or. RDF view: does |
| 233 | force once to think about defining things very clearly. |
| 234 | |
| 235 | Care is needed: can do RDF badly and will still work (locally) but not |
| 236 | good for interoperation. Similar situation with ontology use in |
| 237 | databases. |