Sources of data on the Web

Data catalogues and marketplaces

Data catalogues and marketplaces collect and host data sets and their meta-data. Access to some data may be free, and others may be paid.

These data sets may be CSV files, spreadsheet formats, XML, JSON or even database dumps.

Some examples of sites offering public sector data sets are data.gov.uk, data.gov.us and publicdata.eu.

Commercial marketplaces include Factual, Dun & Brad Street.

You can access a list of data catalogues at Portal Watch.

Web Applications and Sites

There are many web applications that publicly expose information, either through their HTML UI or HTTP APIs. This information might come in the form of:

  • Tables

  • Search results

  • Lists of entities

  • Entity "profile" pages

The data exposed through these UIs and APIs can be collected, but it is only a small slice or the data that is siloed by these companies, and the data that can be collected is restricted to whatever canned queries are available. Furthermore, the number of results that can be fetched can be lower than desired, as there may be artifical limits in place within the UI or API.

Some of the APIs may require authentication, which may place limits around usage.

Data freedom, and the rise of protectionist behaviour

As companies have come to realise the value of Web Data Integration, some comapnies have taken a protectionist view on their own data and employed techniques and companies to try to limit the access of their publicly accessible data by third parties in order to try to reduce competition, or give themselves a market advantage.

This is certainly not in line with the free market approach that most of these companies espouse, rather the data is treated as trade secrets.

While it is certainly the case that companies need to protect themselves against malicious behvaiour, and the potential cost that non-human access of their data may cost, they must start to open their eyes and realise that all their competitors will do the same thing. A protectionist web data arms race will serve no-one apart from those producing the weaponry.

It is certainly my hope that over time these companies will come to see that gathering of publicly available data is not something you do yourself in secret, which castigating others for the same.

The Semantic Web

The concept of the "Semantic Web" extends the web conceptually by using HTTP URIs (URLs) to uniquely identify both entities and the types of relationships between entities.

This type of data is represented in RDF, normally as "triples", where each triple is a subject-property-object, e.g. <Matthew Painter> <graduated from> <Cambridge University>.

SPARQL is the SQL-like query language that is standardized for querying Linked Data. The W3C keeps a list of public SPARQL endpoints.

RDF and SPARQL has a low low commercial adoption rate.

HTML Embedded Structured Data

Data can be embedded in HTML pages, and while that can be very useful, it is often incomplete. As such it can form a solid source of some of the data you are wishing to collect.

The adoption of these different formats continues to grow:

htmldata adoption

Embedding mechanisms

RDFa

RDFa was first proposed in 2004, and was recommended by W3C in 2008. Its use has been superceeded by Microdata and JSON-LD, both of which are simpler formats.

rdfa adoption

Microdata

Microdata is an alternative, less verbose, markup than RDFa:

<div itemscope itemtype="http://schema.org/Person">
  <span itemprop="name">George Bush</span>, the
  <span itemprop="disambiguatingDescription">41st President of the United States</span>
  is the father of
  <div itemprop="children" itemscope itemtype="http://schema.org/Person">
    <span itemprop="name">George W. Bush</span>, the
    <span itemprop="disambiguatingDescription">43rd President of the United States</span>.
  </div>
</div>

It was proposed in 2009 by the WHATWG group as part of the HTML5 specification and standardization process.

JSON-LD

JSON-LD is a JSON representation of linked data, which is both concise and human readable.

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Person",
  "name": "George Bush",
  "disambiguatingDescription": "41st President of the United States",
  "children": {
    "@type": "Person",
    "name": "George W. Bush",
    "disambiguatingDescription": "43rd President of the United States"
  }
}
</script>

Microformats

Microformats date back to 2003, and comprise a small set of fixed formats:

Table 1. Mircoformats
Format Entities

hcard

people, companies, organizations, places

XFN

relationships between people

hCalendar

Calendaring and events

hListing

Classifieds

hReview

reviews of products, businesses, events

The main problem with microformats was their lack of extensibility. They exist now, but really are deprecated, and their use has been superceeded by schema.org embedded data.

Schemas

Schema.org

Google, Yahoo, Bing, and Yandex all had the same problem - the fact the web is a set of unstructured HTML pages and other documents really started to hamper their ability to answer questions for their users that went beyond pointing them to a web page. To improve their customer experience, they needed better, structured data on each page that followed a consistent schema. You can see that they were trying to solve Data Integration in general for the web!

So, in order to try to get access to embedded structured data available on the web pages they were crawling, they came together in 2011 to try to create a single, homogenous schema for representing a lot of data on the web, schema.org. There are currently over 600 entity types that can be represented by schema.org, and as of 2014 over 5 million websites provide schema.org data.

It is worth pointing out that even with Google’s very deep technical pockets they did not start out looking at solving this programatically, although perhaps now they have a lot of training data this is in play.

Along with schema.org came new, simpler ways of embeddeding RDF data in HTML: RDFa, Microdata and JSON-LD.

The reason behind this is simple - better search listings due to better structured data gave better conversion, and so there was a commercial driver to ensure that in the SEO arms race no-one was overtaken: they managed to create a positive feedback loop.

richsnippet

It is worth pointing out that companies can get so large they are seemingly immune to such SEO drivers, for example Amazon has no such embedded linked data.

schema.org adoption

OpenGraph

OpenGraph is a proprietary Facebook schema and embeddeding mechanism based on RDFa that allows companies to define how their pages look when consumed within Facebook. Facebook rolled it out in 2010.

Adoption

As of 2017, approximately 38% of HTML pages had embedded structured data, and approximately 28% of domains have embedded data. Clearly there is a lot of structured data already available on the web.

Over 5 million sites had already embedded schema.org data in 2014.

Common Crawl

The Common Crawl project crawls approximately 3 billion HTML pages every year. Many web sites have a selection of pages, but the aim is not to crawl any site completely.

The Web Data Commons project extracts all the HTML embedded in the common crawl corpus, analyzes it and makes it available for download.

webdatacommons

Web Data Commons is a very good way of getting some representative data where you need a sample across many sites, or for a particular entity type, but should not be considered for deep collection purposes due to its incompleteness.

HTML Tables

There are hundreds of millions of high quality HTML data tables on the web, many of which are within Wikipedia.

tables

Part of the issue here is working out the schemas for the data, clearning the data, etc.

Web Data Commons - Web Tables Corpus

The Web Tables Corpus is a large public corpus of 233 million relational Web Tables.

webtables

These tables were filtered out of a 10.2 billion raw tables from Common Crawl.

Google Tables Search is a Google Research project to make web tables searchable.

googletables

Wikipedia

Several projects exist to try to capture the content of Wikipedia in a structured format.

DBpedia

DBpedia decrises over 6 million things, of which over 5 million are classified in a consistent schema using 760 classes and 2739 properties, including:

  • 1.5M people

  • 1M places

  • 250k organizations

Altogether there are 13 billion RDF triples, of which 1.7 billion are English, 29 million are links to external sites, and 50 million are external links to other data sets.

dbpedia

Wikidata

Wikidata is a collaboratively edited knowledge base hosted by the Wikimedia Foundation. It is a common source of open data that Wikimedia projects such as Wikipedia can use, and by anyone else, under a public domain license.

wikidata