You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

Open data is a goldmine for journalists. Government databases and public repositories offer a wealth of information on everything from census stats to campaign finances. These sources can fuel investigative reporting and data-driven stories.

Navigating open data portals takes some know-how. You'll need to search effectively, assess dataset quality, and import files for analysis. But with the right skills, you can uncover valuable insights hidden in public data to inform your reporting.

Open Data Sources for Journalism

Major Open Data Repositories

Top images from around the web for Major Open Data Repositories
Top images from around the web for Major Open Data Repositories
  • Open data refers to data that is freely available to the public for use and republication without restrictions from copyright, patents or other mechanisms of control (government census data, weather data)
  • Major government open data sources in the United States include:
    • : The U.S. government's central catalog for open data, featuring datasets from various federal agencies (Census Bureau, Department of Labor)
    • : Demographic data collected by the U.S. Census Bureau, including population statistics, housing data, and economic indicators
    • : Labor market data, such as employment statistics, wage data, and consumer price indexes
    • : Economic data, including GDP, personal income, and industry-specific economic indicators
    • : Campaign finance data for federal elections, including candidate fundraising and expenditures
  • Key international government open data sources include:
    • : Datasets related to UN initiatives, such as the Sustainable Development Goals, population statistics, and human development indicators
    • : Economic and development data for countries worldwide, including GDP, poverty rates, and social indicators
    • : Datasets from various EU institutions and member states, covering topics like economy, environment, and public health
    • (Organisation for Economic Co-operation and Development): Economic and social data for OECD member countries, including GDP, employment, and education statistics

Specialized Data Journalism Sources

  • Important non-governmental open data repositories for data journalism include:
    • : A platform for data science competitions and community-contributed datasets on various topics (COVID-19 data, election results)
    • : Datasets from various sources, including the World Bank, Eurostat, and the U.S. Census Bureau, accessible through Google's
    • : Datasets hosted on AWS, covering topics like satellite imagery, genomic data, and web crawl data
    • : Open-source datasets and data-related projects shared by the developer community, often focused on specific topics or use cases (police violence data, climate data)
  • Specialized data journalism sources cater to the specific needs and interests of investigative reporters and newsrooms:
    • NICAR (National Institute for Computer-Assisted Reporting) Data Library: Datasets curated by Investigative Reporters and Editors, covering topics like government spending, crime, and elections
    • : Datasets used in ProPublica's investigative reporting, including data on healthcare, criminal justice, and political influence
    • : Data and code used in Buzzfeed News' data journalism projects, covering topics like political campaigns, social media trends, and public opinion polls

Browsing and Searching for Datasets

  • Open data portals typically offer a web-based interface to browse and search available datasets, allowing users to explore the catalog of available data
  • Navigation often includes:
    • Categories: Datasets grouped by broad topics or themes (education, environment, transportation)
    • Tags: Keywords assigned to datasets to indicate specific subtopics or characteristics (air quality, crime statistics)
    • Featured content: Curated selections of popular or timely datasets highlighted by the portal administrators
  • Keyword searches allow narrowing down to specific topical datasets relevant to the user's interests or research questions
    • Search queries can often be filtered by additional criteria, such as file format (, ), date range, publishing agency, or other metadata
    • Advanced search options may include Boolean operators (AND, OR) or phrase searches to refine results further

Assessing Dataset Metadata and Previews

  • Data portals provide summary information and metadata about each dataset to help users understand the content and assess its relevance:
    • Description: A brief overview of the dataset's contents, purpose, and potential use cases
    • Time period covered: The date range or specific time points included in the dataset (2010-2020, monthly snapshots)
    • Update frequency: How often the dataset is refreshed with new data (daily, annually)
    • File formats: The available file types for downloading the data (CSV, JSON, XLS)
    • Data dictionary: Definitions and descriptions of each variable or field in the dataset
    • Use limitations: Any restrictions or terms of use governing how the data can be utilized or shared
  • Preview functionality lets users see excerpts or samples of the data before downloading full datasets:
    • Previews can be in the form of tables, charts, or maps, depending on the data type and portal features
    • Assessing previews helps determine if the data's structure, granularity, and content match the intended analysis or visualization needs
  • APIs (application programming interfaces) provided by some portals allow programmatic querying and access to data:
    • APIs enable integration of open data directly into data analysis workflows, scripts, or applications
    • Common protocols include REST (Representational State Transfer) and SOAP (Simple Object Access Protocol)
    • API documentation specifies the available endpoints, query parameters, and authentication requirements for accessing data programmatically

Assessing Open Data Quality

Key Dimensions of Data Quality

  • Data quality dimensions provide a framework for evaluating the fitness of open data for specific use cases:
    • Accuracy: The degree to which data values match the real-world entities they represent
    • Completeness: The extent to which all relevant data elements are present and populated
    • Consistency: The absence of contradictions or discrepancies within and across datasets
    • Timeliness: The currency and availability of data relative to the intended analysis or decision-making needs
    • Relevance: The applicability and utility of the data for the specific research questions or journalism project
  • High-quality data is crucial for reliable data journalism to ensure accurate insights, meaningful visualizations, and credible storytelling

Evaluating Data Reliability and Limitations

  • Metadata documentation provides important context about the data, including:
    • Collection methodology: The procedures, instruments, and sampling strategies used to gather the data
    • Variable definitions: Clear explanations of what each data field represents and how it is measured
    • Data processing steps: Any transformations, aggregations, or derivations applied to the raw data
    • Limitations or caveats: Known issues, biases, or constraints that may affect the interpretation or use of the data
  • The original source and methodology of data collection impact reliability:
    • Data from reputable government agencies (U.S. Census Bureau) or established organizations (World Health Organization) is often more reliable than data from unknown or unverified sources
    • Transparent and well-documented data collection processes increase confidence in the data's accuracy and representativeness
  • The age and update frequency of data impacts its timeliness and relevance:
    • Out-of-date data may not accurately reflect current realities or trends, especially for fast-changing phenomena (unemployment rates, public opinion)
    • Data updated on a regular basis (monthly, quarterly) is more suitable for tracking changes over time than one-time or infrequent snapshots
  • Potential sources of error or bias in the data need to be assessed and accounted for:
    • Sampling bias occurs when the data collection process systematically over- or under-represents certain groups or characteristics (online surveys may underrepresent older populations)
    • Measurement error arises from inaccurate or inconsistent data collection instruments or procedures (self-reported income may be subject to recall bias)
    • Missing data, either due to nonresponse or data entry issues, can skew analyses if not properly handled (imputation, weighting)
  • Usage terms, licenses, or any access restrictions need to be reviewed to ensure compliance and appropriate use of the data:
    • Open data licenses (Creative Commons, Open Data Commons) specify conditions for attribution, modification, and redistribution
    • Some datasets may have use restrictions based on privacy concerns, national security, or commercial interests

Importing Open Data for Analysis

Handling Diverse File Formats

  • Open data is published in various file formats, each with its own structure and characteristics:
    • CSV (comma-separated values): Tabular data with columns separated by commas and rows by line breaks, widely compatible and easy to parse
    • JSON (JavaScript Object Notation): Hierarchical data format using key-value pairs and arrays, common for web APIs and NoSQL databases
    • XML (eXtensible Markup Language): Structured data format using tags to define elements and attributes, often used for metadata and data exchange
    • Spreadsheet formats like XLS (Microsoft Excel) or ODS (OpenDocument Spreadsheet) store tabular data with additional formatting and formulas
  • Tabular data in CSV or spreadsheet formats can often be directly imported into analysis tools:
    • Excel: Use the "Data" tab to import from text/CSV or open files directly, specifying delimiters and data types
    • R: Use the
      read.csv()
      function to import CSV files or packages like
      readxl
      for Excel files
    • Python: Use the
      pandas
      library and
      read_csv()
      function for CSV files or
      read_excel()
      for Excel files
  • JSON and XML data may require parsing and transformation steps to extract the relevant data elements:
    • R: Use the
      jsonlite
      package and
      fromJSON()
      function to parse JSON data or the
      xml2
      package for XML data
    • Python: Use the
      json
      module and
      json.load()
      function for JSON parsing or the
      xml.etree.ElementTree
      module for XML parsing
    • Extracted data can be restructured into tabular DataFrames for further analysis and manipulation

Data Cleaning and Preprocessing

  • steps are often necessary after importing to ensure data quality and consistency:
    • Handling missing values: Identifying and deciding how to treat missing data points (deletion, imputation)
    • Reformatting data types: Converting data fields to appropriate formats (string to date, integer to float)
    • Merging related data tables: Combining datasets based on common keys or identifiers to enable joint analysis
    • Deduplication: Removing duplicate records or entries that may distort analyses or aggregations
  • Large datasets may require programmatic approaches to efficiently import and process the data:
    • Pandas library in Python provides powerful functions for data loading, cleaning, and transformation (
      dropna()
      ,
      fillna()
      ,
      merge()
      )
    • Dplyr package in R offers similar data manipulation capabilities (
      filter()
      ,
      mutate()
      ,
      join()
      )
    • Splitting large datasets into smaller chunks or using database connections can help overcome memory constraints
  • Relational databases like SQLite can be used to import and store data for more complex querying and aggregation needs:
    • SQL (Structured Query Language) allows for flexible data retrieval, filtering, and joining across multiple tables
    • Python's SQLite3 module or R's RSQLite package provide interfaces for interacting with SQLite databases
    • Importing data into a database enables efficient querying and integration with other data sources or applications
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary