Transforming the data for Sequence Read Archive (SRA) to RDF
Just a proof of concept, We choose to use the XML data from the Sequence Reads Archive I've first tried to use XSLT to transform the data but it took to much time to analyse the XSD schemas for SRA and make the stylesheets so I wrote this short Java program that loads the DOM and export the RDF to stdout.
I pasted the sources ( sorry quick'n stupid): https://gist.github.com/67bb728957abb16a680b
for example: SRA010050.run.xml looks like this:
<?xml version="1.0" encoding="UTF-8"?> <RUN_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RUN alias="GSM424847_1" instrument_model="Illumina Genome Analyzer" run_cente r="unspecified" total_data_blocks="1" accession="SRR029634"> <EXPERIMENT_REF accession="SRX012521" refname="root_control_1"/> <DATA_BLOCK> <FILES> <FILE filename="DM1.fastq" filetype="fastq"/> </FILES> </DATA_BLOCK> <RUN_ATTRIBUTES> <RUN_ATTRIBUTE> <TAG>quality_book_char</TAG> <VALUE>@</VALUE> </RUN_ATTRIBUTE> <RUN_ATTRIBUTE> <TAG>quality_scoring_system</TAG> <VALUE>log odds</VALUE> </RUN_ATTRIBUTE> </RUN_ATTRIBUTES> </RUN> (...)
And here is the RDF version. Here I used some simple urn as the URIs (parsed successfully with the W3C validator) ...:
<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:my="urn:mynamespace:"> <my:Run rdf:about="urn:sra:run:SRR029634"> <my:accession>SRR029634</my:accession> <my:alias>GSM424847_1</my:alias> <my:instrumentModel>Illumina Genome Analyzer</my:instrumentModel> <my:runCenter>unspecified</my:runCenter> <my:totalDataBlocks>1</my:totalDataBlocks> <my:hasExperiment> <my:Experiment rdf:about="urn:sra:experiment:SRX012521"> <my:accession>SRX012521</my:accession> <my:refname>root_control_1</my:refname> </my:Experiment> </my:hasExperiment> <my:hasDataBlock> <my:DataBlock> <my:hasFile> <my:File> <my:filename>DM1.fastq</my:filename> <my:filetype rdf:resource="urn:sra:filetype:fastq"/> </my:File> </my:hasFile> </my:DataBlock> </my:hasDataBlock> <my:hasRunAttribute> <my:RunAttribute> <my:tag>quality_book_char</my:tag> <my:value>@</my:value> </my:RunAttribute> </my:hasRunAttribute> <my:hasRunAttribute> <my:RunAttribute> <my:tag>quality_scoring_system</my:tag> <my:value>log odds</my:value> </my:RunAttribute> </my:hasRunAttribute> </my:Run> (...)
Using XSLT
The XSLT transformations are a valuable way to transform any XML source to RDF. For example, have a look at those two posts (warning/self promotion ! ) where a set of stylesheets was used to extract some RDF from different sources of XML data:
- http://plindenbaum.blogspot.com/2010/02/linkedinxslt-foaf-people-from.html
- http://plindenbaum.blogspot.com/2010/02/searching-for-genotypes-with-sparql.html
Links
- SRA http://www.ncbi.nlm.nih.gov/sra
- the XSD files for SRA: http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA
- XML files for DRA000039 ftp://ftp.ncbi.nih.gov/sra/Submissions/DRA000/DRA000039/
Attachments
- Photo 1.jpg (2.0 MB) - added by kawaji 15 years ago.