Open Governmental Datasets
One of the main areas of interest here at the LiDRC is linked government data. In this post, DERI researcher Evangelos Kalampokis surveys recent activity in this area from around the globe.
The public sector collects, produces, reproduces and disseminates a large number of information in many areas of activity such as social, economic, geographic, business, and education. It is widely accepted that this information is a significant primary material for digital products and services that could contribute to economic growth . The main problem for reusing this information is that it is kept isolated in various proprietary information systems and formats. Recently, not only practitioners but also governments around the world realise the importance of publishing governmental data using open standards and begin to work towards this direction. Furthermore, in June 2009 Tim Berners-Lee invited governments not only to publish their data on the web using open standards but also to focus on the publishing of linked government data. The later approach will enable the combination of data from different sources in a standardised manner and thus the development of services and applications that provide added value to the society.
Hence during the last year practitioners and governments worldwide worked towards two goals:
- To create catalogues of government data which contain downloadable files in well-known formats such as XML, CSV and RDF.
- To create applications that will provide government data as linked data through RESTful APIs, search interfaces such as SPARQL etc.
Here we present the most significant initiatives worldwide as regards the directions mentioned above, i.e. catalogues of government data and linked government data applications.
Catalogues of open government data. Although there can be various sources of governmental data in different locations on the web (e.g., websites of different public agencies), here we present initiatives that aim at collecting and making available in a specific location on the web a number of public administration related data sets.
Different data formats are used in these catalogues. These formats could be categorised in three groups: raw data formats (e.g. XML, CSV, TXT, XLS), geo-spatial data formats (e.g. SHP, KML), and RDF format.
- Data.gov (United States)
- Data.gov includes about 600 "raw" datasets in CVS/TXT, XML and SHP formats and more than 100.000 geo-spatial datasets such as administrative and political boundaries.
- Data.australia.gov.au (Australia)
- Data.australia.gov.au is the home of Australian government information datasets. It contains about 80 datasets in various formats such as CSV, XML, XLS and SHP about education (e.g., schools location, students enrolments), environment (e.g., water consumption), geography (e.g. administrative boundaries), and other topics.
- NYC Data Mine (United States)
- The New York City Data Mine provides many sets of public data produced by City agencies. This catalogue includes about 100 raw datasets in formats such as XLS, XML, CSV and RSS describing various facilities of the city, statistical data etc. In addition it contains a catalogue which includes about 90 geographical data in GDB, MAP and SHP formats. The geo-datasets contain among others the election districts, the school districts, health centre districts of the New York City.
- DataSF (United States)
- DataSF is a catalogue for datasets published by the City and County of San Francisco. Currently it has more than 100 datasets grouped in the following categories: admin and finance, environment, geography, health, housing, public safety, public works and transportation. The data is provided as downloadable files in TXT, XML, XLS, SHP and KML formats.
- Vancouver’s Open Data Catalogue (Canada)
- The Open Data Catalogue of the City of Vancouver contains about 20 datasets in CSV, XLS, KML and SHP formats. The datasets include information about schools, libraries, fire halls, voting division boundaries, water networks, etc.
- Recovery.gov (United States)
- Recovery.gov was created by the American Recovery and Reinvestment Act of 2009 to provide unprecedented transparency about how Recovery funds are being used and increase accountability to guard against fraud, waste, and abuses. In the download center section of the website a number of datasets are provided for downloading: Financial and Activity Reports filed by federal agencies receiving funds (in an XLS format) and recipient federal contract award data by state (in CSV format).
- Open Gov (Sweden)
- Opengov.se is a private initiative aiming to collect and highlight available public datasets in Sweden. It contains a commentable catalogue of 20 government datasets, their formats and usage restrictions. Data is available only in Swedish.
- The NSW data catalogue (Australia)
- The New South Wales government in Australia published a catalogue of governmental data regarding statistical, geo-spatial, election results, etc. The available data sets (at the moment 63) are grouped based on the government activity and the public agency that provides the data. This catalogue could be characterised as a single point of access for different datasets provided by various governmental web sites. Hence the NSW data catalogue does not include downloadable files but links to different governmental web sites which contain downloadable files.
- Data.govt.nz (New Zealand)
- Data.govt.nz is a directory of New Zealand government datasets. This directory could also be characterised as a single point of access for governmental datasets since it includes links to other governmental web sites that contain the actual files.
Linked Government Data.
- TWC Data-Gov (United States)
- The Tetherless World Constellation at Rensselaer Polytechnic Institute undertook the task to transform a number of raw datasets of Data.gov to linked data. At the time of writing, 116 datasets have been transformed to RDF and are presented in the relevant catalogue.
- US Census Data (United States)
- The US Census dataset provides one billion triples of population statistics and basic geospatial data from the 2000 US Census. This dataset has been published as Linked Data and through a SPARQL endpoint at http://www.rdfabout.com/sparql.
- Data.gov.uk (UK)
- Perhaps due to the involvement of Tim Berners-Lee, the data.gov.uk effort is one of the most ambitious initiatives on this list. The site has recently been launched and is now available as a public beta version. Read more about the internals of data.gov.uk on a recent blog post.
- Ordnance Survey (UK)
- The Ordnance Survey is Great Britain’s national mapping agency. It publishes data about administrative and voting regions in Great Britain as Linked Data. At the moment it is the most advanced linked government data initiative. It provides access to the data through a RESTful API as well as through a SPARQL search interface.
Conclusion. Publishing government data on the web as open or linked data is an important goals of governments around the world. These initiatives can be grouped into two categories: creating catalogues of downloadable government data files in common formats; and creating applications that publish government as linked data and provide RESTful APIs and interfaces such as SPARQL query endpoints.
As researchers in the linked data field we need to further study and analyse these efforts in order to provide useful conclusions about:
- The categorisation of the available datasets in terms of a topic taxonomy.
- The data formats that have to be better supported in linked data publishing tools.
- Best practices and guidelines for publishing linked data in the government sector.
 European Union (2003). “Directive 2003/98/EC of the European Parliament and of the Council on the re-use of public sector information” Official Journal of the European Union, 345, 90.