Implementation Bootcamp
- Implementation Bootcamp
- Implementation tips and tricks / FAQ
- Triplestores
-
Ontologizing
- What are some life science semantic web initiatives I should know about?
- I have to build an OWL model of my data - how do I go about doing this?
- Which version of Protege should I use?
- How do I align ontologies?
- What are the similarities and differences between the various shared names …
- UniProt PURLS
- Web services
-
Generating RDF
- How do I convert OBO to RDF?
- What XSL processor to use, should you want to convert legacy xml to rdf?
- What to do with RDFa metadata?
- I have database in RMDB, how can I convert them directly to RDF?
- How granular should my returned RDF be?
- In my generated RDF, what namespace URI do I use to prefix my terms?
- Where do I validate my RDF/XML?
- Which module (Perl) can I use to serialize my RDF statements?
- SPARQL
- Reasoning
Implementation tips and tricks / FAQ
Here we collect useful tips and tricks so that we don't all have to reinvent the wheel. Please add any answers and/or questions!
Triplestores
I have to build a triplestore, what tool should I use under what circumstances?
Virtuoso seems to be the triple-store of choice at the moment, but it does suffer from problems with data import. We (Wilkinson lab, Belleau/Bio2RDF, Dumontier lab) have considerable experience with this, that we will write tutorials about and add the link here soon! There is also a wiki page that lists the various available choices, and a page about 4store.
How to access a triplestore in Perl/Ruby/Python/Java etc.?
In Python, you have RDFlib. And for Java Jena is your best option. The PerlRDF site lists the options for Perl. Ruby has ActiveRDF -- seems to have issues with the adapters, though (solved now!).
Ontologizing
What are some life science semantic web initiatives I should know about?
Aside from initiatives that were explicitly mentioned at the hackathon, here are some other ones not to ignore:
- A number of biomedical ontologies are collected by the NCBO BioPortal.
- For evolutionary biologists and phyloinformaticists there is CDAO,
- for taxonomists and biodiversity informaticists there is the TDWG ontology and DarwinCore.
Some issues are discussed by Rod Page and in this paper, for example.
Maybe these already do what you're trying to do, but if you have to ontologize your own problem domain, make an effort to align your terms with these efforts. The idea is that we're building one big graph that connects everything!
I have to build an OWL model of my data - how do I go about doing this?
For understanding what you're supposed to do, read
- The OWL 2 primer to learn about OWL,
- A Semantic Web Primer for Object-Oriented Software Developers if you know anything about OO modeling in software, as it will help avoid a lot of common pitfalls, and
- Semantic Web for the Working Ontologist if you want in-depth explanation with lots of examples.
- The hackathon wiki page on OWL
Then use Protege to actually build the ontology.
It is highly recommended that you "make friends" with someone who has a deep understanding of OWL, and the consequences of various OWL constructs, as you go through your learning experience. While the existing tutorials are good for telling you what is possible, they aren't always entirely clear about the consequences of choosing one encoding method versus another, and this dramatically affects your ability to reason over your data. Unfortunately, there are few shortcuts - OWL is hard!
Which version of Protege should I use?
Protege 3 and Protege 4 are "philosophically" different, and represent a split in the global ontology community that runs roughly along the lines of the "OBO-fans" and the "OWL-DL-fans" (that's over-simplifying the situation, but I think it is by-and-large correct). The two development communities had different target-audiences in mind when developing the software, and those audiences are reflected in the decisions made. Protege 4 uses the Manchester OWL API "under the hood", and is somewhat more capable of manipulating OWL than Protege 3 is. On the other hand, if you are planning to use Protege to generate RDF data ("individuals") manually, then Protege 3 might be more useful for you.
How do I align ontologies?
At the VoCamp/TDWG satellite meeting in Montpellier (2009), a working group devoted to this topic developed the following recommendations:
For ontology integration, our work has led us to conclude that:
- instance data should be fully ontologized. For example, our phenoscape use case could not be completed because phenoscape uses XML literals to express trait post-composition. These traits were consequently inaccessible for the purpose of data integration.
- ontologies should be designed as reusable modules rather than monolithic artifacts. Aligning CDAO with DarwinCore was relatively easy because DarwinCore doesn't have a lot of structure (which is a good thing from the perspective of re-use). (although DarwinCore still needs to be ontologized).
- data integration is most easily achieved by developing small adaptor ontologies rather than merging of large (and potentially well-established and "stable") artifacts. Merging large ontologies has a greater potential to have irreconcilable incongruities. Adapting smaller ontologies requires immediate reconciliation, but insulates the practitioner from irrelevant inconsistencies. Implementations are likely to be more efficient and scalable. Nevertheless, if two domains have significant overlap, it is probably better to merge them, reconcile the inconsistencies and thereby decrease the overall noise subsequent use of the domain.
- URIs (URLs) for terms should be carefully constructed, predictable and stabilized, perhaps using PURLs. For example, several queries failed to produce expected results due to omission of www prefixes or # suffixes in URLs.
- several tools (Homonto developed by BGee, LOOM) and a lot of research ( Ontology Matching) has already gone into the problem of ontology alignment. However, expert knowledge for manual alignment is often still necessary.
What are the similarities and differences between the various shared names proposals?
Shared names proposals such as LSRN? UniProt?
MDW: See above, and ask specific questions that I can try to answer myself, or invite the representatives from the other proposals to answer. (Get well soon, Alan!!!!)
Here are the SharedName requirements:
- It must be clearly stated what the intended referent of each URI is supposed to be, i.e. that the URI denotes some particular record from some particular database.
- Information about the URI and its referent, including such a statement, must be made available, and in order to leverage existing protocol stacks, it must be obtainable via HTTP. (We'll call such information "URI documentation".)
- URI documentation must be provided in RDF.
- Provision of URI documentation must be an ongoing concern. The ability to provide it may have to outlive the original database or the database's creator.
- The provider of the URI documentation must be responsive to community needs, such as the need to have mistakes fixed in a timely manner.
- URI documentation must be open so that it can be replicated and reused.
UniProt PURLS
UniProt uses its own purls not only for their own data but also for all the cross references that we link to.
- For example purl.uniprot.org/HGNC/37122 is redirected to http://www.genenames.org/data/hgnc_data.php?hgnc_id=37122
- We do this because we have to maintain and keep stable links into the future. Meaning that when one of our cross reference databases changes their urls. We change the redirection. One maintenance location.
- When merging datasets people have to collapse these different URL's into one. Using either a regexp or owl:sameAs statements.
- Benefit to using UniProt purls when available is that there is an ongoing maintenance effort.
- They are documented in dbxref for the external datasets. ** And there is work being done to make this available in owl/rdf including the internal datasets.
Remember don't get to hung up about this. Pick an URI in your data set and change it when required -- see page URI for the results of the discussion with data providers here at the hackathon. Kaizen, build something and then keep on improving it ;)
Web services
I have an analysis tool, how do I expose it as a semantic web resource?
SADI provides one solution. Luke McCarthy gave a Java tutorial Thursday 11 February 2010, and Mark Wilkinson gave the Perl tutorial on the same day. Edward Kawas from the Wilkinson lab has produced movies detailing how to create services in Perl for SADI, and Mark will be doing the voice-over for these movies and putting them up on YouTube in the second week of February 2010. Mark will add a link here. The same will be done for the Java side once we have the extra-cool Java functionalities coded and ~stable. In particular, Luke McCarthy and Paul !Gordon have been working together at the Hackathon finding simple ways to put SADI Java services into the Google Cloud... so you might not even have to consume your own compute resources to achieve this!
When someone calls GET on my URLs, what should I return in order to be semantic webby?
You should probably use Content Negotation using the HTTP Accept: header. That way you can return HTML and RDF under the same URI. However, it's fine for the time being to just accept a ".rdf" suffix or an "&format=rdf" URL parameter.
At UniProt, we use content negotiation for our OWL documentation and our actual data.
> wget --header='Accept: application/rdf+xml' http://purl.uniprot.org/core/recommendedName [...] Saving to: `recommendedName.rdf > wget --header='Accept: text/html' http://purl.uniprot.org/core/recommendedName [...] Saving to: `recommendedName.html
MDW: As an aside, the idea of content-negotiation has been extensively discussed within the Semantic Web for Healthcare and Life Science community, and it was not widely welcomed. The point of the Semantic Web is that things should be explicit, so there is some preference given to explicitly indicating (in your RDF metadata) that any given URI is going to return one syntax or another. (though I have to agree, I am quite a fan of content-negotiation, given that this is exactly the problem that it was designed to solve!! :-) )
MDW: Going back as far as 2004, when the LSID specification was being finalized, this issue was a top-priority, so there is a sub-commmunity of bioinformatics data providers who have thought about this problem for many many years! :-) This has led to a variety of "shared names" proposals, including the Science Commons, Semantic Science, LSRN, and others. In SADI (and now LSRN, since my lab has taken-over the LSRN project in the past 2 months) we have decided to work with the Semantic Science shared-names proposal from Michel Dumontier. He has developed an ontology (I will provide a link to this as soon as Michel decides that the ontology is "final"... within days!!). The ontology defines how a URI should "behave" during resolution, depending on the kind of "thing" that the URI represents - e.g. a biological/physical entity, a database record, or a particular representation of a database record in html, xml, rdf, etc. Within the SADI project, we will be writing all of our support code to make compliance with the Semantic Science ontology as automatic as possible. We are also in the process of doing the same for URIs resolved through the LSRN resolution system... so if you use SADI or LSRN, you should get compliance with this ontology "for free" within the next week or two! In My Opinion This Is One Of The Most Important Issues We Have Addressed At This Hackathon!'' The Semantic Web works SO much better if we are careful to pay attention to what our URIs REPRESENT: things, records, or representations of records. It sounds tedious, but we're doing everything we can to shield the data providers from having to think deeply about the problem, and trying to encode the complexity in our respective codebases.
How do I create a SADI service?
Links to tutorials coming soon!!!
Generating RDF
How do I convert OBO to RDF?
If your input data is represented in OBO format, you can use the obo2rdf.pl script provided by ONTO-PERL
Also, you can convert your OBO file to OWL here.
What XSL processor to use, should you want to convert legacy xml to rdf?
Pierre Lindenbaum uses xsltproc which works fine. For example see:
- http://plindenbaum.blogspot.com/2010/02/linkedinxslt-foaf-people-from.html
- http://plindenbaum.blogspot.com/2010/02/searching-for-genotypes-with-sparql.html
When the XML source is too large to fit in the memory of xsltproc, I (Pierre Lindenbaum ) use a custom tool named xslstream that calls a new XSLT transformation for every chunks of data. For example say you want to convert the XML files of DBSNP ( ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ e.g. ds_ch1.xml.gz is 1099375 KB ) with dbsnp2rdf.xsl ( http://code.google.com/p/lindenb/source/browse/trunk/src/xsl/dbsnp2rdf.xsl ). Download xsltstream from http://code.google.com/p/lindenb/downloads/list And then invoke:
java -jar xsltstream.jar -x dbsnp2rdf.xsl -q "Rs" 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch1.xml.gz' |\ grep -v "rdf:RDF" |\ grep -v "<?xml
Result:
(...) <o:SNP rdf:about="http://www.ncbi.nlm.nih.gov/snp/2854"> <dc:title>rs2854</dc:title> <o:taxon rdf:resource="http://www.ncbi.nlm.nih.gov/taxonomy/9606"/> <o:het rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0.24</o:het> <o:hasMapping> <o:Mapping> <o:build rdf:resource="urn:void:ncbi:build:Celera/36_3"/> <o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/> <o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">196613685</o:start> <o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">196613686</o:end> <o:orient>+</o:orient> </o:Mapping> </o:hasMapping> <o:hasMapping> <o:Mapping> <o:build rdf:resource="urn:void:ncbi:build:HuRef/36_3"/> <o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/> <o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">194069483</o:start> <o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">194069484</o:end> <o:orient>+</o:orient> </o:Mapping> </o:hasMapping> <o:hasMapping> <o:Mapping> <o:build rdf:resource="urn:void:ncbi:build:reference/36_3"/> <o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/> <o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">221460932</o:start> <o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">221460933</o:end> <o:orient>+</o:orient> </o:Mapping> </o:hasMapping> </o:SNP> <o:SNP rdf:about="http://www.ncbi.nlm.nih.gov/snp/2866"> <dc:title>rs2866</dc:title> <o:taxon rdf:resource="http://www.ncbi.nlm.nih.gov/taxonomy/9606"/> <o:het rdf:datatype="http://www.w3.org/2001/XMLSchema#float">0.50</o:het> <o:hasMapping> <o:Mapping> <o:build rdf:resource="urn:void:ncbi:build:Celera/36_3"/> <o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/> <o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">220636770</o:start> <o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">220636771</o:end> <o:orient>-</o:orient> </o:Mapping> </o:hasMapping> <o:hasMapping> <o:Mapping> <o:build rdf:resource="urn:void:ncbi:build:HuRef/36_3"/> <o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/> <o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">217734218</o:start> <o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">217734219</o:end> <o:orient>-</o:orient> </o:Mapping> </o:hasMapping> <o:hasMapping> <o:Mapping> <o:build rdf:resource="urn:void:ncbi:build:reference/36_3"/> <o:chrom rdf:resource="urn:void:ncbi:chromosome:9606/chr1"/> <o:start rdf:datatype="http://www.w3.org/2001/XMLSchema#int">245407000</o:start> <o:end rdf:datatype="http://www.w3.org/2001/XMLSchema#int">245407001</o:end> <o:orient>-</o:orient> </o:Mapping> </o:hasMapping> </o:SNP> (...)
Rutger Vos is using a stylesheet that uses XSL2.0 features, which libxslt (on which xsltproc is based) doesn't like. He therefore uses saxon to transform NeXML into RDF.
What to do with RDFa metadata?
RDFa embedded inside HTML or XML can be turned into proper RDF/XML triples using this xsl stylesheet: http://nexml-dev.nescent.org/nexml/xslt/RDFa2RDFXML.xsl
I have database in RMDB, how can I convert them directly to RDF?
Possibly using a protege plug-in (which on?) or by providing a web service. There is D2RQ which works okey but lacks a bit performance-wise. However, this really depends on whether or not you intend to publish your database as a SPARQL endpoint. The poll that Pierre Lindenbaum and Mark Wilkinson took over the past couple of days suggests that only 5 data providers (within Tweet-shot of us) currently provide SQL access to their data resources. This does not seem to bode well for having data providers set-up SPARQL endpoints: why would they open themselves to a new, unfamiliar technology when they don't open themselves to a well-known, tested, and highly powerful technology?
Mark Wilkinson's team have tried to make a compelling argument that exposing resources via SADI Web Services gives you the best of both worlds - a highly-granular control over what data you expose, how you expose it, and over the distribution of large numbers of requests over your compute-resources; yet the SHARE client helps make it *appear* that the entire world is one big SPARQL endpoint (on steroids, since you can SPARQL data that doesn't even exist until you ask the question!)
Mark's opinion (biased!) is that SADI Web Services are a better way to expose RDF data compared to SPARQL endpoints. Moreover, it doesn't require you to change your existing data infrastructure in any way - you don't need to have a triple-store to expose your data as triples via SADI. With a Web Service-based exposure, you can migrate your data gradually/modularly, a few properties at a time, rather than attempting to move your entire database to the Semantic Web in one shot... and gain experience as you go. Given that it is currently not (natively) possible to SPARQL query over multiple endpoints, you aren't losing anything by going the SADI route either. Finally, all of your resources (both database and analytical tools) are exposed in exactly the same way, meaning that they are all accessed by clients in exactly the same way, simplifying client design.
How granular should my returned RDF be?
There was a very brief discussion of this issue on Thursday, the answer was "be pragmatic". Highly granular data (like absolute expression-level changes for microarrays) might not be appropriate for conversion into RDF because it explodes the size of the dataset in a circumstance where (a) the dataset is generally going to be used as a whole anyway, and (b) there are completely adequate parsers for existing file-formats, and (c) the benefit of being able to reason over an RDF representation of the data is limited, or absent. On the other hand, there is no reason (in Rutger Vos's opinion) why an atomic datum (such as a single site in a sequence) that is considered a resource under the used data model shouldn't return a brief description of itself upon resolving that resource, provided that the context within that resource has meaning can be located (e.g. by referring to the defining resource using rdfs:isDefinedBy).
In my generated RDF, what namespace URI do I use to prefix my terms?
It is best to do this using a real URL such that the terms can actually be resolved (unlike XML), preferably to an OWL file, e.g. "xmlns:foo=http://example.org/terms.owl#" such that construct "foo:bar" can be resolved.
Where do I validate my RDF/XML?
http://www.w3.org/RDF/Validator/
Which module (Perl) can I use to serialize my RDF statements?
If you want to generate RDF/XML data, there are four modules available. Details please refer to this page.
SPARQL
How do I learn SPARQL?
Here are the links used in Francois' SPARQL lecture:
http://delicious.com/fbelleau/bh2010:sparql
There is also a SparqlTutorial page here on the wiki.
Where can I practice SPARQL queries?
The following site can be used to easily create some SPARQL queries and learn more about the syntax:
http://www.semantic-systems-biology.org/biogateway/querying
Note that you can select sample (pre-canned) queries (biological and ontological) and modify them, also it is possible to build your own SPARQL queries using that interface.
Finally, your SPARQL query result can also be visualized using the following browser:
http://www.semantic-systems-biology.org/biogateway/sparql-viewer/
Can SPARQL be converted into SQL, or vice versa?
Ultrawrap Project provides a useful information.
Can SPARQL be converted into XQuery, or vice versa?
Reasoning
What is reasoning?
Reasoning can be used in a few ways. It can be used to infer data that was not hard coded into the database. e.g. Triples are added in the store that you did not encode.
an example.
ProteinA :interactsWith ProteinB
without reasoning when you ask
sparql ?with where (ProteinB :interactsWith ?with)
will return no results. When the owl statement is added
:interactsWith owl:inverseOf :interactsWith
Then the same query will return proteinA as ?with.
You can also introduce classes as shown by the SADI example. To make more readable queries. For example give me all human proteins known to be a kinase. e.g.
select ?p where {?p a :Protein . ?p :classifiedWith <http://purl.uniprot.org/keywords/418> . ?p :organism <http://purl.uniprot.org/taxonomy/9606> .)
While
select ?p where (?p a :HumanKinase)
is much more readable to a human/biologist. This :HumanKinase type can be generated on the fly using owl inference.
It can also be used to do quality control. e.g. when assigning the keyword complete proteome to an uniprot entry a chromosomal location must be known. When an entry is found for a with the keyword complete proteome but no chromosomal location. A contradiction is raised. i.e. and error. This allows us to notify a curator and fix this oversight. There are, however, some differences in how OWL is used when it is used as a schema and when it is used for reasoning; there's a little bit about this topic here.