Open data is a goldmine for journalists. Government databases and public repositories offer a wealth of information on everything from census stats to campaign finances. These sources can fuel investigative reporting and data-driven stories.
Navigating open data portals takes some know-how. You'll need to search effectively, assess dataset quality, and import files for analysis. But with the right skills, you can uncover valuable insights hidden in public data to inform your reporting.
Open Data Sources for Journalism
Major Open Data Repositories
Top images from around the web for Major Open Data Repositories
A Repository of Open Data Repositories: Open Data Impact Case Studies and Examples View original
Is this image relevant?
Workshop on "Open Government: Open Data, Open Source and Open Standards" - Creative Commons View original
A Repository of Open Data Repositories: Open Data Impact Case Studies and Examples View original
Is this image relevant?
Workshop on "Open Government: Open Data, Open Source and Open Standards" - Creative Commons View original
Is this image relevant?
1 of 3
Open data refers to data that is freely available to the public for use and republication without restrictions from copyright, patents or other mechanisms of control (government census data, weather data)
Major government open data sources in the United States include:
: The U.S. government's central catalog for open data, featuring datasets from various federal agencies (Census Bureau, Department of Labor)
: Demographic data collected by the U.S. Census Bureau, including population statistics, housing data, and economic indicators
: Labor market data, such as employment statistics, wage data, and consumer price indexes
: Economic data, including GDP, personal income, and industry-specific economic indicators
: Campaign finance data for federal elections, including candidate fundraising and expenditures
Key international government open data sources include:
: Datasets related to UN initiatives, such as the Sustainable Development Goals, population statistics, and human development indicators
: Economic and development data for countries worldwide, including GDP, poverty rates, and social indicators
: Datasets from various EU institutions and member states, covering topics like economy, environment, and public health
(Organisation for Economic Co-operation and Development): Economic and social data for OECD member countries, including GDP, employment, and education statistics
Specialized Data Journalism Sources
Important non-governmental open data repositories for data journalism include:
: A platform for data science competitions and community-contributed datasets on various topics (COVID-19 data, election results)
: Datasets from various sources, including the World Bank, Eurostat, and the U.S. Census Bureau, accessible through Google's
: Datasets hosted on AWS, covering topics like satellite imagery, genomic data, and web crawl data
: Open-source datasets and data-related projects shared by the developer community, often focused on specific topics or use cases (police violence data, climate data)
Specialized data journalism sources cater to the specific needs and interests of investigative reporters and newsrooms:
NICAR (National Institute for Computer-Assisted Reporting) Data Library: Datasets curated by Investigative Reporters and Editors, covering topics like government spending, crime, and elections
: Datasets used in ProPublica's investigative reporting, including data on healthcare, criminal justice, and political influence
: Data and code used in Buzzfeed News' data journalism projects, covering topics like political campaigns, social media trends, and public opinion polls
Navigating Open Data Portals
Browsing and Searching for Datasets
Open data portals typically offer a web-based interface to browse and search available datasets, allowing users to explore the catalog of available data
Navigation often includes:
Categories: Datasets grouped by broad topics or themes (education, environment, transportation)
Tags: Keywords assigned to datasets to indicate specific subtopics or characteristics (air quality, crime statistics)
Featured content: Curated selections of popular or timely datasets highlighted by the portal administrators
Keyword searches allow narrowing down to specific topical datasets relevant to the user's interests or research questions
Search queries can often be filtered by additional criteria, such as file format (, ), date range, publishing agency, or other metadata
Advanced search options may include Boolean operators (AND, OR) or phrase searches to refine results further
Assessing Dataset Metadata and Previews
Data portals provide summary information and metadata about each dataset to help users understand the content and assess its relevance:
Description: A brief overview of the dataset's contents, purpose, and potential use cases
Time period covered: The date range or specific time points included in the dataset (2010-2020, monthly snapshots)
Update frequency: How often the dataset is refreshed with new data (daily, annually)
File formats: The available file types for downloading the data (CSV, JSON, XLS)
Data dictionary: Definitions and descriptions of each variable or field in the dataset
Use limitations: Any restrictions or terms of use governing how the data can be utilized or shared
Preview functionality lets users see excerpts or samples of the data before downloading full datasets:
Previews can be in the form of tables, charts, or maps, depending on the data type and portal features
Assessing previews helps determine if the data's structure, granularity, and content match the intended analysis or visualization needs
APIs (application programming interfaces) provided by some portals allow programmatic querying and access to data:
APIs enable integration of open data directly into data analysis workflows, scripts, or applications
Common protocols include REST (Representational State Transfer) and SOAP (Simple Object Access Protocol)
API documentation specifies the available endpoints, query parameters, and authentication requirements for accessing data programmatically
Assessing Open Data Quality
Key Dimensions of Data Quality
Data quality dimensions provide a framework for evaluating the fitness of open data for specific use cases:
Accuracy: The degree to which data values match the real-world entities they represent
Completeness: The extent to which all relevant data elements are present and populated
Consistency: The absence of contradictions or discrepancies within and across datasets
Timeliness: The currency and availability of data relative to the intended analysis or decision-making needs
Relevance: The applicability and utility of the data for the specific research questions or journalism project
High-quality data is crucial for reliable data journalism to ensure accurate insights, meaningful visualizations, and credible storytelling
Evaluating Data Reliability and Limitations
Metadata documentation provides important context about the data, including:
Collection methodology: The procedures, instruments, and sampling strategies used to gather the data
Variable definitions: Clear explanations of what each data field represents and how it is measured
Data processing steps: Any transformations, aggregations, or derivations applied to the raw data
Limitations or caveats: Known issues, biases, or constraints that may affect the interpretation or use of the data
The original source and methodology of data collection impact reliability:
Data from reputable government agencies (U.S. Census Bureau) or established organizations (World Health Organization) is often more reliable than data from unknown or unverified sources
Transparent and well-documented data collection processes increase confidence in the data's accuracy and representativeness
The age and update frequency of data impacts its timeliness and relevance:
Out-of-date data may not accurately reflect current realities or trends, especially for fast-changing phenomena (unemployment rates, public opinion)
Data updated on a regular basis (monthly, quarterly) is more suitable for tracking changes over time than one-time or infrequent snapshots
Potential sources of error or bias in the data need to be assessed and accounted for:
Sampling bias occurs when the data collection process systematically over- or under-represents certain groups or characteristics (online surveys may underrepresent older populations)
Measurement error arises from inaccurate or inconsistent data collection instruments or procedures (self-reported income may be subject to recall bias)
Missing data, either due to nonresponse or data entry issues, can skew analyses if not properly handled (imputation, weighting)
Usage terms, licenses, or any access restrictions need to be reviewed to ensure compliance and appropriate use of the data:
Open data licenses (Creative Commons, Open Data Commons) specify conditions for attribution, modification, and redistribution
Some datasets may have use restrictions based on privacy concerns, national security, or commercial interests
Importing Open Data for Analysis
Handling Diverse File Formats
Open data is published in various file formats, each with its own structure and characteristics:
CSV (comma-separated values): Tabular data with columns separated by commas and rows by line breaks, widely compatible and easy to parse
JSON (JavaScript Object Notation): Hierarchical data format using key-value pairs and arrays, common for web APIs and NoSQL databases
XML (eXtensible Markup Language): Structured data format using tags to define elements and attributes, often used for metadata and data exchange
Spreadsheet formats like XLS (Microsoft Excel) or ODS (OpenDocument Spreadsheet) store tabular data with additional formatting and formulas
Tabular data in CSV or spreadsheet formats can often be directly imported into analysis tools:
Excel: Use the "Data" tab to import from text/CSV or open files directly, specifying delimiters and data types
R: Use the
read.csv()
function to import CSV files or packages like
readxl
for Excel files
Python: Use the
pandas
library and
read_csv()
function for CSV files or
read_excel()
for Excel files
JSON and XML data may require parsing and transformation steps to extract the relevant data elements:
R: Use the
jsonlite
package and
fromJSON()
function to parse JSON data or the
xml2
package for XML data
Python: Use the
json
module and
json.load()
function for JSON parsing or the
xml.etree.ElementTree
module for XML parsing
Extracted data can be restructured into tabular DataFrames for further analysis and manipulation
Data Cleaning and Preprocessing
steps are often necessary after importing to ensure data quality and consistency:
Handling missing values: Identifying and deciding how to treat missing data points (deletion, imputation)
Reformatting data types: Converting data fields to appropriate formats (string to date, integer to float)
Merging related data tables: Combining datasets based on common keys or identifiers to enable joint analysis
Deduplication: Removing duplicate records or entries that may distort analyses or aggregations
Large datasets may require programmatic approaches to efficiently import and process the data:
Pandas library in Python provides powerful functions for data loading, cleaning, and transformation (
dropna()
,
fillna()
,
merge()
)
Dplyr package in R offers similar data manipulation capabilities (
filter()
,
mutate()
,
join()
)
Splitting large datasets into smaller chunks or using database connections can help overcome memory constraints
Relational databases like SQLite can be used to import and store data for more complex querying and aggregation needs:
SQL (Structured Query Language) allows for flexible data retrieval, filtering, and joining across multiple tables
Python's SQLite3 module or R's RSQLite package provide interfaces for interacting with SQLite databases
Importing data into a database enables efficient querying and integration with other data sources or applications