Changes between Version 7 and Version 8 of TextMiningReport

2010/03/07 02:06:43 (10 years ago)



  • TextMiningReport

    v7 v8  
     24== For the paper == 
     26Modern biology increasingly depends on the availability of computational tools to process, analyze, interpret, and integrate large collections of heterogeneous data. As has been described above, biological knowledge is increasingly being represented as a collection of triplets (entity-relationship-entity) that can be queried and mined in order to discover and create knowledge. 
     28However, biological databases alone cannot capture the richness of scientific information and argumentation contained in the literature, neither can provide support for the novel ways in which scientists will interrogate these databases.  
     29A considerable fraction of the existing data in biology consists of natural language texts used to describe and communicate new discoveries, and so, scientific papers constitute a resource with crucial importance for life sciences. As the amount of scholarly communication increases, it is increasingly difficult for specific core scientific statements to be found, connected and curated. 
     31The biomedical text mining community has been working for a long time in the development of reliable information extraction applications. Both name entity recognition and conceptual analysis are needed in order to map from natural language texts to a formal representation of the objects and concepts represented by the text, with direct links to online resources that explicitly expose those concepts semantics. 
     33Different web tools allow researchers to search literature databases and integrate semantic information extracted from text with external databases and ontologies. Our work concentrated in Whatizit (, Reflect ( and Medie ( 
     35Semantic Web technologies provide a platform to extract statements and facts from existing literature and share them in a way that will allow computational agents to discover, aggregate and interpret these facts. The advantages of semantic web approaches are clear, and ideally, the concepts in a statement and the statement itself will have some unique identity that connects each instance of a statement across the web of published material. 
     37During the Text Mining sessions of the BioHackathon reported here, we aimed to investigate how to automatically annotate atomic components of research papers in life sciences and how to express those annotations in RDF. 
     39Beyond the SPARQL and Linked Data approaches, we investigated how to embed annotations in the output of exiting services, and found RDFa a suitable technology for this. RDFa (or Resource Description Framework - in - attributes) is a W3C Recommendation that adds a set of attribute level extensions to XHTML for embedding rich metadata within Web documents. The RDF data model mapping enables its use for embedding RDF triples within XHTML documents, it also enables the extraction of RDF model triples by compliant user agents. 
     41The use of RDFa enabled output can be seen in the Science Commons text annotation service ( and (