Queries
Targeted queries to be resolved by Semantic Web technologies.
One of our goal would be to implement an advanced search engine -- something like a Wolfram Alpha for biology. In this page, we will gather meaningful and challenging queries for the Semantic Web technology and will define some goals of the BioHackathon 2010 from them.
Expected query types
Queries may vary from simple google-like keyword or phrase search to biological objects as its input. This section itemize prospected types of query arbitrary.
- Gene ID
- Technical terms
- Phrase search (e.g. "ortholog genes of gene A")
- DNA or protein sequence
- :
Prospected queries
From TREC Genomics Trac
TREC (Text REtreival Conference) held a competition for information extraction (IE) or text mining (TM) researchers and raised following 14 challenging questions to retrieve answering passages automatically. Some of these may also be considered as targets for Semantic Web.
- <T1>What [ANTIBODIES] have been used to detect protein TLR4?
- <T2>What [BIOLOGICAL SUBSTANCES] have been used to measure toxicity in response to cytarabine?
- <T3>What [CELL OR TISSUE TYPES] express members of the mammalian TIM gene family?
- <T4>What [DISEASES] are associated with lysosomal abnormalities in the nervous system?
- <T5>What [DRUGS] have been tested in mouse models of Alzheimer's disease?
- <T6>What centrosomal [GENES] are implicated in diseases of brain development?
- <T7>What [MOLECULAR FUNCTIONS] does helicase protein NS3 play in HCV ( Hepatitis C virus)?
- <T8>What [MUTATIONS] in apolipoprotein genes are associated with disease?
- <T9>Which [PATHWAYS] are possibly involved in the disease ADPKD?
- <T10>What [PROTEINS] does epsin1 interact with during endocytosis?
- <T11>What Streptococcus pneumoniae [STRAINS] are resistent to penicillin and erythromycin?
- <T12>What [SIGNS OR SYMPTOMS] of anxiety disorder are related to lipid levels?
- <T13>What [TOXICITIES] are associated with cytarabine?
- <T14>What [TUMOR TYPES] are associated with Rb1 mutations?
Is there any way of translating a natural language query to a SPARQL one? or should we need to develop?
Some "buffer" layer, which bridges the gap between highly structured language such as SPARQL and unstructured one (i.e., natural language), might be needed.
Swoogle may be useful. Also, information extraction (IE) technology is crucial to identify a key concept in literature.
Natural language processing (NLP) technology used for IE in life science such as MEDIE or Enju can be used.
- Resource Description Framework (RDF)
- SPARQL Query Language for RDF
- RDF Data Access Use Cases and Requirements
- Swoogle Search
?something pred:Detect ?uniprotid . ?something pred:isA ProteinType:antibody . ?uniprotid pred:hasName ProteinName:TLR4 . ?uniprotid pred:isA idType:uniprot .
?something pred:Measure degreeOf:ToxicityOfCytarabine . ?something pred:isA Entity:BiologicalSubstance .
?somegene pred:expressIn ?cell_or_tissue . ?cell_or_tissue pred:isA Entity:Cell or Tissue . ?somegene pred:isA familyOf:mammalian TIM gene .
?disease pred:associate medical_genetics:lysosomal abnormalities . ?disease pred:isA umls:DiseaseorSyndrome . medical_genetics:lysosomal abnormalities pred:occurredIn Tissue:nervous system .
?drug pred:tested DeseaseModel:mouse . ?drug pred:target disease:Alzheimer's disease . ?drug pred:isA Entity:Drug .
?gene pred:isA Entity:Gene . ?gene pred:isA CellularLocalization:centrosome . ?gene pred:associate/implicate ?disease . ?disease pred:isA umls:DiseaseorSyndrome . ?disease pred:hasRelation go:brain development .
?something pred:isA go:molecular function . ?uniprotid pred:hasName ProteinName:NS3 . ?uniprotid pred:isA idType:uniprot . ?uniprotid pred:isA go:helicase activity . ?uniprotid pred:play ?something . ?something pred:occurredIn Organism:Hepatitis C virus .
From FlyMine Templates
InterMine/FlyMine provides a curated set of typical query templates. As these are checked by the biologists, most of them seems to be meaningful compared to computationally generated queries based on combinations of available resources.
- Disease [Human] --> Genes + D. melanogaster homologues (filter on BLAST E-Value).
- For a specified human disease, show all associated human genes and their Drosophila homologues as identified with BLAST. Results are filtered by a specified E-Value, the lower the E-Value, the better the match. (Data Source: Homophila).
- EST clone [A. gambiae] --> Chromosomal location + Gene + D. melanogaster orthologue + Pathway.
- For a particular A. gambiae EST clone, show its chromosomal location, the corresponding A. gambiae gene, the D. melanogaster orthologue and the pathway. (Data Source: VectorBase, InParanoid, KEGG, Reactome, FlyReactome).
From participants
Please add your own use cases here:
- Q: I have a set of predicted genes from a newly sequenced genome of a Theileria species. What is a unique characteristic of this set from the view point of the biological pathway? Is it natural that this organism lacks thiolase in the mevalonate pathway? (Toshiaki Katayama)
- To resolve this sample question, we need taxonomically classified list of biological modules in the pathway filled with a phylogenetic profile of orthologous genes among related species. Then we can confirm its absence with protein2genome mapping, predicted sub cellular localization of the enzyme, knowledge on reaction in the organelle (apicoplast) from related journal articles and reviews. Most of components are already in databases but none of them are semantically linked with each other.
- Q: What is the largest gene expression regulation network ever known coming with expression level data? (Itoshi Nikaido)
- What actually wanted is a "graph structure" and "feature labels" of each node.
Attachments
-
TREC_QA_Sample_enju.txt
(10.2 KB) - added by yy
15 years ago.
A parsed result of TREC Questions using enju.
-
TREC_QA_Sample_metamap.txt
(34.1 KB) - added by yy
15 years ago.
A parsed result of TREC Questions using MetaMap.