Informationsbeschaffung aus digitalen Textressourcen - Domänenadaptive Verfahren zur Strukturierung heterogener Textdokumente
Key: Sch16-1
Author: Sebastian Schmidt
Date: January 2016
Kind: @phdthesis
Abstract: In today's information society, users are increasingly confronted with the so-called information overload problem. They are often overwhelmed by the huge amount of mostly digitally available textual resources when trying to identify relevant information suiting their information needs. So far, users are mainly left only with a full-text search due to the lack of more elaborate tools which would allow them to specify different aspects of their information need. Elaborate search tools, that allow a precise definition of information needs, only exist in specific domains. One of the main reasons is that the mostly unstructured nature of digital textual resources does not allow access to specific information within the documents which would enable the realization of these tools. A structured representation of the documents, where the meaning of individual text fragments for the entities being described in the documents is known, would allow for this access. The goal of this thesis is to investigate approaches that would automatically transform documents into structured representations. Existing approaches that have similar aims are often tailored to specific applications and thus cannot be easily applied to other applications or domains. Their deployment in new domains currently requires a redesign of the approaches or significant manual effort for their adaptation. Based on this observation, this thesis aims to develop domain-adaptive approaches to structure textual documents. A major challenge for the design of appropriate methods is the heterogeneity of application domains, in particular with regards to the document formats, lengths of texts, and domain-specific terminology used. A study of five selected heterogeneous domains revealed the existence of common types of information across domains. As a result of this finding, different methods were designed to identify information in heterogeneous documents for three of these types. As a design requirement, it was considered that only little manual effort is accepted when deploying the methods to a new domain. This requirement enables a good domain adaptation of the methods. In order to reduce the manual effort needed, techniques from the field of machine learning, such as Active Learning, were applied. Furthermore, freely available and domain-independent knowledge bases were integrated. The approaches were implemented and evaluated using data sets from the observed domains. Results showed that the identification of information of individual types is possible while still maintaining a good domain adaptivity. Finally, a concept was presented that combines methods for the identification of information with the goal of structuring entire documents. An implementation and evaluation of this concept revealed that structuring can be obtained through a combination of different methods, whereby each method identifies only a single type of information. The domain adaptive means presented in this dissertation enable the creation of structured representations from unstructured digital textual resources. This simplifies the realization of various tools for information retrieval. The resulting possibilites for the development of new information retrieval tools reduce the overload problem experienced by users when trying to identify relevant information.
Official URL

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, not withstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.