Introduction

The discipline of Web Data Integration (WDI) is an extension of Data Integration.

Data Integration is the process of aggregating data from different sources into a homogenous view, which includes data access, transformation, mapping, quality assurance and fusion of data from multiple sources. It is, in effect, offering the fragmented data up as if it were a single database for query.

Web Data Integration extends and specialises Data Integration to see the Web as a collection of views of databases accessible over the Web protocols, including:

  • Open data catalogues

  • Government data catalogues

  • Web applications and sites

    • UI

    • API

  • The Semantic Web (SPARQL)

  • HTML Embedded Structured Data

  • HTML Data Tables

  • Spreadsheets

  • PDFs

  • Online encyclopedias

Web Data Integration has technical challenges in homogenising the disparate data that is a reflect the sources of the web data, rather than Data Integration which is primarily concerned with making heterogenous databases homogeneous.

Quality and veracity of data is more important in WDI than in Data Integration, as in Data Integration the data is more implicitly trusted and of high quality than that which is collected from an external web source.

Web data may be combined with internal data as part of an organization’s Data Integration lifecycle.

By collecting and homogenising this data, organizations can leverage it to feed their next generation of data-driven business applications, analytics and AI platform in order to gain revenues and competitive advantage.

Web Data Integration providers such as import.io aim to allow the enterprise to use and build on web data with the same high levels of trust and confidence that are associated with internal datasets.