🪓Data Journalism Unit 3 – Finding Data: Strategies and Sources

Finding data is a crucial skill for data journalists. This unit covers strategies for locating relevant datasets, from government portals to academic repositories. It teaches how to evaluate data quality, use search techniques, and navigate common challenges in data acquisition. The unit also introduces tools for data discovery and emphasizes practical application. Students learn to define research questions, prioritize sources, and document their process. Skills in data cleaning, integration, and exploratory analysis are covered to prepare journalists for real-world data projects.

What's This Unit About?

  • Explores strategies and sources for finding data to support data journalism projects
  • Covers key concepts and terms related to data discovery and acquisition
  • Identifies various data sources and where to look for relevant datasets
  • Teaches effective search strategies and techniques to locate specific data
  • Provides guidance on evaluating the quality and reliability of data sources
  • Introduces tools and platforms that facilitate data discovery and access
  • Discusses common challenges encountered during data search and potential solutions
  • Emphasizes the practical application of data finding skills in real-world scenarios

Key Concepts and Terms

  • Data sources refer to the origins or providers of datasets (government agencies, research institutions, private companies)
  • Data formats include structured (CSV, JSON) and unstructured (text, images) data types
  • Data portals are centralized platforms that host and provide access to multiple datasets
  • APIs (Application Programming Interfaces) enable programmatic access to data from web services
  • Data scraping involves extracting data from websites or documents using automated tools
  • Data licensing determines the legal permissions and restrictions for using a dataset
  • Metadata provides descriptive information about a dataset's structure, content, and provenance
  • Data quality assessment evaluates the accuracy, completeness, and consistency of a dataset

Data Sources: Where to Look

  • Government open data portals (data.gov, European Data Portal) provide access to public sector datasets
  • International organizations (World Bank, United Nations) publish data on global development and social issues
  • Academic and research institutions offer datasets related to various fields of study
  • Scientific data repositories (NASA, NOAA) host environmental and scientific datasets
  • Business and financial data providers (Bloomberg, Crunchbase) offer company and market data
  • Social media platforms (Twitter, Facebook) can be sources of user-generated data
  • Crowdsourced data initiatives (OpenStreetMap) rely on community contributions
  • Domain-specific data portals cater to specific industries or topics (healthcare, transportation)

Search Strategies and Techniques

  • Use relevant keywords and phrases to narrow down the search scope
    • Combine keywords using Boolean operators (AND, OR, NOT) to refine results
  • Employ advanced search features, such as filters and facets, to specify data characteristics (date range, geographic coverage)
  • Explore subject-specific databases and repositories aligned with the research topic
  • Consult data catalogs and directories that curate and organize datasets by theme or domain
  • Leverage data search engines (Google Dataset Search) that index datasets from various sources
  • Engage with data communities and forums to seek recommendations and discover lesser-known datasets
  • Utilize data request mechanisms (Freedom of Information Act) to access government-held data
  • Collaborate with domain experts who may have knowledge of specialized data sources

Evaluating Data Quality

  • Assess the credibility and reputation of the data provider or source
  • Check for the presence and completeness of metadata describing the dataset
  • Examine the data collection methodology and any potential biases or limitations
    • Consider the sample size, representativeness, and potential selection biases
  • Verify the timeliness and update frequency of the dataset to ensure currency
  • Investigate the data cleaning and preprocessing steps applied to the raw data
  • Assess the consistency and integrity of the data across different dimensions or sources
  • Validate the data against other reliable sources or ground truth measurements
  • Review any available documentation or data dictionaries for clarity and comprehensiveness

Tools for Data Discovery

  • Data catalogs and search engines (Google Dataset Search, Kaggle Datasets) facilitate dataset discovery across multiple sources
  • Web scraping tools (BeautifulSoup, Scrapy) enable automated extraction of data from websites
  • API clients and libraries (Python's Requests, R's httr) simplify accessing data through APIs
  • Data wrangling and cleaning tools (OpenRefine, Trifacta) help explore and preprocess datasets
  • Visualization tools (Tableau, D3.js) provide interactive exploration of data patterns and insights
  • Statistical software (R, Python libraries) offers powerful data analysis and modeling capabilities
  • Geospatial analysis tools (QGIS, ArcGIS) enable working with location-based datasets
  • Collaborative data platforms (GitHub, Kaggle) foster sharing and collaboration around datasets

Common Challenges and Solutions

  • Data availability: Some desired datasets may not be publicly accessible or may require permissions
    • Solution: Explore alternative data sources, submit data requests, or consider data scraping techniques
  • Data quality issues: Datasets may contain missing values, inconsistencies, or errors
    • Solution: Apply data cleaning techniques, impute missing values, and validate data integrity
  • Data integration: Combining datasets from different sources can be challenging due to format and schema variations
    • Solution: Use data integration tools, standardize data formats, and develop data mapping strategies
  • Legal and ethical considerations: Certain datasets may have usage restrictions or raise privacy concerns
    • Solution: Review data licenses, adhere to data protection regulations, and anonymize sensitive information
  • Technical barriers: Accessing and processing large datasets may require specialized skills and infrastructure
    • Solution: Collaborate with technical experts, leverage cloud computing resources, and optimize data processing workflows

Putting It into Practice

  • Define the research question or problem statement to guide the data discovery process
  • Break down the data requirements into specific variables, granularity, and coverage needed
  • Prioritize data sources based on relevance, reliability, and ease of access
  • Iteratively search and evaluate datasets, refining search strategies as needed
  • Document the data discovery process, including search queries, sources explored, and evaluation criteria
  • Clean and preprocess the acquired datasets to ensure data quality and consistency
  • Integrate and combine datasets from multiple sources to enrich the analysis
  • Perform exploratory data analysis to gain insights and identify potential stories or angles
  • Communicate findings and insights effectively through data visualizations and narratives
  • Continuously update and expand the data collection as new sources or datasets become available


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.