Transforming the data for Sequence Read Archive (SRA) to RDF

Just a proof of concept, We choose to use the XML data from the Sequence Reads Archive I've first tried to use XSLT to transform the data but it took to much time to analyse the XSD schemas for SRA and make the stylesheets so I wrote this short Java program that loads the DOM and export the RDF to stdout.

I pasted the sources ( sorry quick'n stupid):  https://gist.github.com/67bb728957abb16a680b

for example: SRA010050.run.xml looks like this:

 <?xml version="1.0" encoding="UTF-8"?>
<RUN_SET xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <RUN alias="GSM424847_1" instrument_model="Illumina Genome Analyzer" run_cente
r="unspecified" total_data_blocks="1" accession="SRR029634">
    <EXPERIMENT_REF accession="SRX012521" refname="root_control_1"/>
    <DATA_BLOCK>
      <FILES>
        <FILE filename="DM1.fastq" filetype="fastq"/>
      </FILES>
    </DATA_BLOCK>
    <RUN_ATTRIBUTES>
      <RUN_ATTRIBUTE>
        <TAG>quality_book_char</TAG>
        <VALUE>@</VALUE>
      </RUN_ATTRIBUTE>
      <RUN_ATTRIBUTE>
        <TAG>quality_scoring_system</TAG>
        <VALUE>log odds</VALUE>
      </RUN_ATTRIBUTE>
    </RUN_ATTRIBUTES>
  </RUN>
(...)

And here is the RDF version. Here I used some simple urn as the URIs (parsed successfully with the W3C validator) ...:

<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:my="urn:mynamespace:">
	<my:Run rdf:about="urn:sra:run:SRR029634">
		<my:accession>SRR029634</my:accession>
		<my:alias>GSM424847_1</my:alias>
		<my:instrumentModel>Illumina Genome Analyzer</my:instrumentModel>
		<my:runCenter>unspecified</my:runCenter>
		<my:totalDataBlocks>1</my:totalDataBlocks>
		<my:hasExperiment>
			<my:Experiment rdf:about="urn:sra:experiment:SRX012521">
				<my:accession>SRX012521</my:accession>
				<my:refname>root_control_1</my:refname>
			</my:Experiment>
			
		</my:hasExperiment>
		<my:hasDataBlock>
			<my:DataBlock>
				<my:hasFile>
					<my:File>
						<my:filename>DM1.fastq</my:filename>
						<my:filetype rdf:resource="urn:sra:filetype:fastq"/>
					</my:File>
					
				</my:hasFile>
			</my:DataBlock>
			
		</my:hasDataBlock>
		<my:hasRunAttribute>
			<my:RunAttribute>
				<my:tag>quality_book_char</my:tag>
				<my:value>@</my:value>
			</my:RunAttribute>
			
		</my:hasRunAttribute>
		<my:hasRunAttribute>
			<my:RunAttribute>
				<my:tag>quality_scoring_system</my:tag>
				<my:value>log odds</my:value>
			</my:RunAttribute>
			
		</my:hasRunAttribute>
	</my:Run>
(...)

Using XSLT

The XSLT transformations are a valuable way to transform any XML source to RDF. For example, have a look at those two posts (warning/self promotion ! ) where a set of stylesheets was used to extract some RDF from different sources of XML data:

Links

Attachments