Version 18 (modified by yy, 7 years ago)

--

Queries

Targeted queries to be resolved by Semantic Web technologies.

One of our goal would be to implement an advanced search engine -- something like a Wolfram Alpha for biology. In this page, we will gather meaningful and challenging queries for the Semantic Web technology and will define some goals of the BioHackathon 2010 from them.

Expected query types

Queries may vary from simple google-like keyword or phrase search to biological objects as its input. This section itemize prospected types of query arbitrary.

  • Gene ID
  • Technical terms
  • Phrase search (e.g. "ortholog genes of gene A")
  • DNA or protein sequence
  • :

Prospected queries

From  TREC Genomics Trac

 TREC (Text REtreival Conference) held a competition for information extraction (IE) or text mining (TM) researchers and raised following 14 challenging questions to retrieve answering passages automatically. Some of these may also be considered as targets for Semantic Web.

  • <T1>What [ANTIBODIES] have been used to detect protein  TLR4?
  • <T2>What [BIOLOGICAL SUBSTANCES] have been used to measure toxicity in response to  cytarabine?
  • <T3>What [CELL OR TISSUE TYPES] express members of the mammalian  TIM gene family?
  • <T4>What [DISEASES] are associated with  lysosomal abnormalities in the  nervous system?
  • <T5>What [DRUGS] have been tested in mouse models of  Alzheimer's disease?
  • <T6>What centrosomal [GENES] are implicated in diseases of  brain development?
  • <T7>What [MOLECULAR FUNCTIONS] does  helicase protein NS3 play in HCV ( Hepatitis C virus)?
  • <T8>What [MUTATIONS] in apolipoprotein genes are associated with disease?
  • <T9>Which [PATHWAYS] are possibly involved in the disease ADPKD?
  • <T10>What [PROTEINS] does epsin1 interact with during endocytosis?
  • <T11>What Streptococcus pneumoniae [STRAINS] are resistent to penicillin and erythromycin?
  • <T12>What [SIGNS OR SYMPTOMS] of anxiety disorder are related to lipid levels?
  • <T13>What [TOXICITIES] are associated with cytarabine?
  • <T14>What [TUMOR TYPES] are associated with Rb1 mutations?

Is there any way of translating a natural language query to a SPARQL one? or should we need to develop?
Some "buffer" layer, which bridges the gap between highly structured language such as SPARQL and unstructured one (i.e., natural language), might be needed.  Swoogle may be useful. Also, information extraction (IE) technology is crucial to identify a key concept in literature. Natural language processing (NLP) technology used for IE in life science such as  MEDIE can be used.

?something pred:Detect ProteinName:TLR4 .
?something pred:isA ProteinType:antibody .
?something pred:Measure degreeOf:ToxicityOfCytarabine .
?smoething pred:isA Entity:BiologicalSubstance .

From  FlyMine Templates

InterMine/FlyMine provides a curated set of typical query templates. As these are checked by the biologists, most of them seems to be meaningful compared to computationally generated queries based on combinations of available resources.

  • Disease [Human] --> Genes + D. melanogaster homologues (filter on BLAST E-Value).
    • For a specified human disease, show all associated human genes and their Drosophila homologues as identified with BLAST. Results are filtered by a specified E-Value, the lower the E-Value, the better the match. (Data Source: Homophila).
  • EST clone [A. gambiae] --> Chromosomal location + Gene + D. melanogaster orthologue + Pathway.
    • For a particular A. gambiae EST clone, show its chromosomal location, the corresponding A. gambiae gene, the D. melanogaster orthologue and the pathway. (Data Source: VectorBase, InParanoid, KEGG, Reactome, FlyReactome).

From participants

Please add your own use cases here:

  • Q: I have a set of predicted genes from a newly sequenced genome of a Theileria species. What is a unique characteristic of this set from the view point of the biological pathway? Is it natural that this organism lacks thiolase in the mevalonate pathway? (Toshiaki Katayama)
    • To resolve this sample question, we need taxonomically classified list of biological modules in the pathway filled with a phylogenetic profile of orthologous genes among related species. Then we can confirm its absence with protein2genome mapping, predicted sub cellular localization of the enzyme, knowledge on reaction in the organelle (apicoplast) from related journal articles and reviews. Most of components are already in databases but none of them are semantically linked with each other.
  • Q: What is the largest gene expression regulation network ever known coming with expression level data? (Itoshi Nikaido)
    • What actually wanted is a "graph structure" and "feature labels" of each node.

Attachments