Knowledge Management and XML: Derivation of Synthetic Views over Semistructured Data

Cannataro, M; Guzzo, A.; Pugliese, A.

doi:10.1145/568235.568242

One of the effects of the expansion of the World Wide Web is the production of a huge amount of data, differentiated for type, available to a large number of different users. Furthermore, the constant progress of computer hardware technology in the past three decades has led to the availability of powerful computers, data collection equipments, and storage media; this technology provides a great boost to the database and information industry by allowing transaction management, information retrieval, and data analysis over massive amounts of heterogeneous data. Moreover, the explosion of Internet increases the availability of data in different formats: structured (e.g. relational), semistructured (e.g. HTML, XML) and unstructured (e.g. plain text, audio/video) data [2]. Thus, new data management systems, able to take advantage of these heterogeneous data, are emerging and will play a vital role in the information industry. Thus, heterogeneous database systems emerge and play a vital role in the information industry. Knowledge Management is concerned with the technological, economic and organizational aspects related to (i) the creation, distribution, diversification and sharing of knowledge in complex organizations and to (ii) the management of informative flows, processes and interactions with external Knowledge [8]. Figure 1 summarizes the steps (each represented on a different level of the pyramid) through which knowledge is typically extracted from basic data. The first three levels regard the management of explicit knowledge (i.e. codified, structured or semistructured and completely available). In particular, starting from the bottom, the first level is concerned with storing and exchanging "factual" knowledge, essentially corresponding to basic data. Technologies used here comprise Databases [17], Data Repositories, Archive Sharing tools and the emerging Extensible Markup Language (XML) [18]. The second level regards "conceptual knowledge" modeling, i.e. the definition of concepts and relationships among them. Such knowledge is typically represented by means of diagram-based formalisms for both information and related processes [9]. The Unified Modeling Language (UML) is currently one of the most promising modeling languages, oriented towards the specification,implementation and documentation of complex software systems, but also used for modeling company processes not strictly related to the software. The third level is concerned with organization and integration of information represented according to heterogeneous formalisms. Techniques used here are essentially those concerning Data Warehousing (DW) [10]. Data warehouses are integrated repositories of data extracted from multiple heterogeneous sources, organized under a unified schema and at a single site, in order to facilitate management and decision making. Data Warehousing technologies include data cleaning, data integration, and Online Analytical Processing (OLAP), i.e. analysis techniques based on aggregation and summarization. The highest level regards Knowledge Discovery, i.e. the uncovering of new, implicit and potentially useful knowledge from large amounts of data. The core phase of knowledge discovery is Data Mining [10], an interactive, iterative, multi-step process, comprising in particular pattern searching and eventual refinements on the basis of domain experts' knowledge. In the context of explicit knowledge management, the Extensible Markup Language takes naturally place. XML is a language for semistructured data [1, 5] of the World Wide Web Consortium (W3C) [13] which is designed to allow marking, transferring and reusing information by means of a standard method of definition of the documents structure and format. Its metalanguage features have been used in knowledge management typically for (i) the semi-automatic production of documents, (ii) the reuse of semistructured information and its integration in heterogeneous systems, (iii) the creation of knowledge maps for the organization and sharing of information. The increasing quantity of available semistructured data and the use of XML for their description and exchange discovers new reaserch themes related to management and knowledge extraction over XML data. In this scenario, our proposal consists of a system for the syntesization of XML documents that attempts to extract their semantics and to derive synthetic versions of them by means of a multidimensional interpretation [10]. In the contest of Knowledge Management, data synthesization can be regarded as a new way for knowledge extraction, by discovering and aggregating (useful) core information and by neglecting (useless) details.