posted on 2023-01-18, 18:15authored byHong Quang Nguyen
Submission note: A thesis submitted in total fulfilment of the requirements for the degree of Doctor of Philosophy to the School of Engineering and Mathematical Sciences, Faculty of Science, Technology and Engineering, La Trobe University, Bundoora.
Schema integration aims to create a mediated schema as a uniform interface to a set of data sources, and to create schema mappings between this mediated schema and the source schemas. The mediated schema and schema mappings are critical for interoperability in numerous data-sharing applications such as enterprise integration, commerce and bioinformatics. Extensible Markup Language (XML) has been widely used for representing and sharing data due to its promising interoperability support. Unfortunately, XML sources cause the problem of schema heterogeneity, in which the same concepts and relations can be represented in many different ways, with different choices of elements and structures. Such heterogeneity leads to substantial semantic and structural conflicts. To resolve these conflicts and reduce user effort, systematic approaches are required to create the mediated schema and mappings. In this thesis, we propose a systematic approach for integrating schemas of multiple heterogeneous XML sources. Our key contributions are fivefold. First, we propose a schema integration framework that combines three traditionally separate tasks: matching, clustering and merging. Second, we integrate the salient semantic aspects of the domain by concept clustering. We resolve the semantic conflicts for creating a set of concepts. Third, we further integrate the structural aspects of the domain by relation clustering. We resolve structural conflicts for creating relations between the concepts. Besides the commonly used similarity measurement, the conflict resolution incorporates a new type of relevance measurement that estimates how a concept or a relation is relevant to the domain. Fourth, both the concepts and relations are merged into a unique double-layered mediated schema that aims to achieve a high degree of completeness and minimality: retaining all of the concepts and relations without redundancy. Finally, we validate the applicability of the integration results in query answering over multiple sources. Our experimental results on both real and synthetic datasets show an improvement compared with PORSCHE, COMA++ and RONDO with respect to (i) the mediated-schema quality, based on precision, recall, F-measure, and schema minimality; and (ii) the execution performance, based on execution time and scale-up performance.
History
Center or Department
Faculty of Science, Technology and Engineering. School of Engineering and Mathematical Sciences.
Thesis type
Ph. D.
Awarding institution
La Trobe University
Year Awarded
2011
Rights Statement
This thesis contains third party copyright material which has been reproduced here with permission. Any further use requires permission of the copyright owner. The thesis author retains all proprietary rights (such as copyright and patent rights) over all other content of this thesis, and has granted La Trobe University permission to reproduce and communicate this version of the thesis. The author has declared that any third party copyright material contained within the thesis made available here is reproduced and communicated with permission. If you believe that any material has been made available without permission of the copyright owner please contact us with the details.