Data Integration

One definition that follows for data integration is that:

Data integration is the process of consolidating data from a set of heterogenous data sources into a homogenous data set or view on the data.

It is, in effect, offering the fragmented data up as if it were a single database for query.

The scale of challenge

Data integration is a hard problem, it is "AI Complete", meaning that completely automated solutions are unlikely. The reasons for this are three-fold:

  1. System level: different platforms

  2. Logical: schema and data homogenity

  3. Social: locating data, getting access to data

As it is AI Complete, we must aim for 2 goals:

  1. Reduce the effort needed to set up an integration application.

  2. Enable the system to perform gracefully with uncertainty

Solving data integration

To break this out more, there are a number of properties that we want from the homogenous database:

  1. accurately and completly represent the content of all data sources;

  2. represent the entities using a single data model and schema;

  3. not contain duplicate records representing an entity (unless expected);

  4. resolve any conflicting data about entities.

In order to achieve this, data integration needs to be able to resolve the different types of heterogenity between different data sources.

Data Source Heterogenity

There are five different types of heterogenity that we need to solve for in the data:

  1. Data Access

  2. Data Encoding

  3. Data Model

  4. Data Structure

  5. Data Semantics

As we will see, the application of standards in each of these areas can reduce or eliminate the heterogenity in these areas.

Access Heterogenity

There are many different layers to be being able to access data, and we must bridge these layers in order to consolidate our data.

Table 1. Access Levels
Level Examples

Protocol

HTTP/REST, JDBC, SOAP

Data Exchange Format

JSON, Protobuf, Parquet

Query Language

Full Languages, e.g. SQL Predefined; Canned queries, e.g. web forms/APIs,

Canned queries, e.g. web forms/APIs

Data dumps

Restrictions

Thoughput, cost, rights

Note that although databases may hae a full language to construct any desired query on the data, in reality pre-defined indexes may be required for such.

Encoding Heterogenity

This is the difference in the encoding of the data and meta-data. This often does not apply when using some standardised access methods, such as JDBC, where any underlying differences in how the data is encoded is factored out by the access layer. You can see that it would apply however in data dumps, for example:

Table 2. Encoding Examples
Type Examples

Text

Different binary encodings of character sets, e.g. UTF-8

Numbers

Different encoding of numbers, e.g. fixed length, varint, little vs big endian

Delimiter

Which character is the C of CSV

In practice, this is the easiest bridge to build between data sources due to there being good support for most different standards and charsets in most platforms and programming languages.

Data Model Heterogenity

This is the difference in the type of Data Model that is used to represent data. This is importantly different from the schema, which describes how we model the attributes of entities, but rather this is how we model the entities themselves and the relationships between them.

Table 3. Data Model Classes
Class Relationships Examples

Relational

Row-to-Row

RDBMS, e.g. MySQL. Note that a "cell" can itself have hierarchical data stored within it, but the relationships are modelled from tow to row.

Hierarchical

Nested data

XML/JSON; Document Stores where the data is stored inline "nested" rather than normalized into multiple tables.

Object Oriented

Inheritance, Object-to-Object

Object Oriented databases

Graph

Vertex-to-Vertex, Directed

RDF, Graph databases

Relational++

Row-to-Row, Inheritance, Nested

PostgreSQL: a "cell" can itself have hierarchical data stored within it, there exists table inheritance, but the relationships are modelled between rows

Structural Heterogenity

Structural heterogenity is the difference in the data structures, such as tables, used to model the same data. This can include differences in:

  1. Alternative modelling

  2. Normalization

  3. Nesting vs Foreign Key

Alternative modelling

Consider how you capture data for dogs and cats. Do you have a dogs table and a cats table? Do you have an animals table with an species column? Do you have an is_cat column if you know it’s only cats and dogs? There are many ways to model the same data.

Nomalization

Normalization refers to the level of different tables you break things out to. For example, let’s say that each person has a list of hobbies. In a RDBMS you could have a person and a hobby table, and a join table to track who has what hobbies. Alternatively, you could just put the string values in a "hobbies" array within the person row. Or you could remove the join table and track a list of ids within the person row.

Higher levels of normalization are generally thought to be "cleaner" representations of data, whereas you can get better real-world performance from some denormalization due to the disk/memory access required to fetch a person and their hobbies.

Nesting vs Foreign Key

Similar to normalization, this is a question of whether the data is stored in a normalized form, or nested within.

For example, let’s suppose you have people and their pets. Each pet only ever has one owner. In PostgreSQL you could either have a people and pet table, or alternatively you could have a pets column within the person table that stored a JSON blob that contained a list of pets, their names, ages, and so on.

Semantic Heterogenity

Semantic heterogenity is the differences in the meaning of the schema and data of the different data sources.

Data heterogenity

Table 4. Data heterogenity
Class Examples

Numeric

Can be power difference (e.g. GB vs MB), currency (e.g. USD vs GBP), or unit type (e.g. km vs miles)

Enum

Different representations in textual (e.g. Male vs M), or numeric (e.g. 2 ⇒ Manager) encodings

Text

Synonyms (e.g. Street vs St.)

Homonyms (e.g. transliteration of Arabic)

Spellings (e.g. color vs colour)

Conceptual, e.g. same name but different meaning

Date/Time

Implicit timezone

Schematic heterogenity

Table 5. Schematic heterogenity
Class Examples

Attribute naming

Synonyms across sources, e.g. Last Name vs Surname

Attribute composition

e.g. Name vs First Name & Last Name

Object Identity

The same item, e.g. product, is often represented in multiple data source, or sometimes even by multiple unlinked records within the same Data Source.

The reason that we could have duplicates in a data sources could include:

  • Human error, such as typos

  • Lack of consistent global identifier (primary key)

Data conflicts

Multiple data sources can have different values for the same entity attribute. This could be decause of:

  • Errors

  • Old data in one source

  • Disagreement - one source actually thinks that the value is different

In these cases there normally needs to be an automated way of deciding what the "source of truth" is for a given attribute.