1 | | [[Image(Photo 1.jpg)]] |
| 1 | [[Image(Photo 1.jpg,200px)]] |
| 2 | |
| 3 | = Transforming the data for Sequence Read Archive (SRA) to RDF = |
| 4 | |
| 5 | Just a proof of concept, We choose to use the XML data from the Sequence Reads Archive |
| 6 | I've first tried to use '''XSLT''' to transform the data but it took to much time to analyse the '''XSD schemas''' for SRA and make the stylesheets so I wrote this short Java program that loads the DOM and export the RDF to stdout. |
| 7 | |
| 8 | I pasted the sources ( sorry quick'n stupid): https://gist.github.com/67bb728957abb16a680b |
| 9 | |
| 10 | for example: '''SRA010050.run.xml''' looks like this: |
| 11 | {{{ |
| 12 | <?xml version="1.0" encoding="UTF-8"?> |
| 13 | <RUN_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> |
| 14 | <RUN alias="GSM424847_1" instrument_model="Illumina Genome Analyzer" run_cente |
| 15 | r="unspecified" total_data_blocks="1" accession="SRR029634"> |
| 16 | <EXPERIMENT_REF accession="SRX012521" refname="root_control_1"/> |
| 17 | <DATA_BLOCK> |
| 18 | <FILES> |
| 19 | <FILE filename="DM1.fastq" filetype="fastq"/> |
| 20 | </FILES> |
| 21 | </DATA_BLOCK> |
| 22 | <RUN_ATTRIBUTES> |
| 23 | <RUN_ATTRIBUTE> |
| 24 | <TAG>quality_book_char</TAG> |
| 25 | <VALUE>@</VALUE> |
| 26 | </RUN_ATTRIBUTE> |
| 27 | <RUN_ATTRIBUTE> |
| 28 | <TAG>quality_scoring_system</TAG> |
| 29 | <VALUE>log odds</VALUE> |
| 30 | </RUN_ATTRIBUTE> |
| 31 | </RUN_ATTRIBUTES> |
| 32 | </RUN> |
| 33 | (...) |
| 34 | }}} |
| 35 | And here is the RDF version. Here I used some simple ''urn'' as the URIs (parsed successfully with the W3C validator) ...: |
| 36 | {{{ |
| 37 | <?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:my="urn:mynamespace:"> |
| 38 | <my:Run rdf:about="urn:sra:run:SRR029634"> |
| 39 | <my:accession>SRR029634</my:accession> |
| 40 | <my:alias>GSM424847_1</my:alias> |
| 41 | <my:instrumentModel>Illumina Genome Analyzer</my:instrumentModel> |
| 42 | <my:runCenter>unspecified</my:runCenter> |
| 43 | <my:totalDataBlocks>1</my:totalDataBlocks> |
| 44 | <my:hasExperiment> |
| 45 | <my:Experiment rdf:about="urn:sra:experiment:SRX012521"> |
| 46 | <my:accession>SRX012521</my:accession> |
| 47 | <my:refname>root_control_1</my:refname> |
| 48 | </my:Experiment> |
| 49 | |
| 50 | </my:hasExperiment> |
| 51 | <my:hasDataBlock> |
| 52 | <my:DataBlock> |
| 53 | <my:hasFile> |
| 54 | <my:File> |
| 55 | <my:filename>DM1.fastq</my:filename> |
| 56 | <my:filetype rdf:resource="urn:sra:filetype:fastq"/> |
| 57 | </my:File> |
| 58 | |
| 59 | </my:hasFile> |
| 60 | </my:DataBlock> |
| 61 | |
| 62 | </my:hasDataBlock> |
| 63 | <my:hasRunAttribute> |
| 64 | <my:RunAttribute> |
| 65 | <my:tag>quality_book_char</my:tag> |
| 66 | <my:value>@</my:value> |
| 67 | </my:RunAttribute> |
| 68 | |
| 69 | </my:hasRunAttribute> |
| 70 | <my:hasRunAttribute> |
| 71 | <my:RunAttribute> |
| 72 | <my:tag>quality_scoring_system</my:tag> |
| 73 | <my:value>log odds</my:value> |
| 74 | </my:RunAttribute> |
| 75 | |
| 76 | </my:hasRunAttribute> |
| 77 | </my:Run> |
| 78 | (...) |
| 79 | }}} |
| 80 | = Using XSLT = |
| 81 | The '''XSLT''' transformations are a valuable way to transform any XML source to RDF. For example, have a look at those two posts (''warning/self promotion ! '') where a set of stylesheets was used to extract some RDF from different sources of XML data: |
| 82 | |
| 83 | * http://plindenbaum.blogspot.com/2010/02/linkedinxslt-foaf-people-from.html |
| 84 | * http://plindenbaum.blogspot.com/2010/02/searching-for-genotypes-with-sparql.html |
| 85 | |
| 86 | |
| 87 | = Links = |
| 88 | |
| 89 | * SRA http://www.ncbi.nlm.nih.gov/sra |
| 90 | * the XSD files for SRA: http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA |
| 91 | * XML files for DRA000039 ftp://ftp.ncbi.nih.gov/sra/Submissions/DRA000/DRA000039/ |