You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

Data collection and acquisition are crucial steps in data science. This section covers various data sources, including , files, and web services, as well as modern sources like IoT devices and .

Understanding data structures is key to effective analysis. We'll explore structured, semi-structured, and types, along with their characteristics and limitations. This knowledge helps in choosing appropriate tools and techniques for data processing.

Data Sources

Database Types and Characteristics

Top images from around the web for Database Types and Characteristics
Top images from around the web for Database Types and Characteristics
  • Databases organize data for efficient retrieval and manipulation
    • (MySQL, PostgreSQL) structure data in tables with predefined schemas
    • (MongoDB, Cassandra) offer flexible schemas for unstructured or
  • Files store data on computer systems
    • (CSV, JSON) contain human-readable data
    • (images, audio) store non-textual information
    • (Excel, Google Sheets) combine features of databases and text files
  • Web services enable machine-to-machine interaction over networks
    • use HTTP methods for data operations
    • employ XML-based messaging protocols
    • allow clients to request specific data structures

Modern Data Sources

  • devices generate real-time data
    • Smart home sensors (temperature, humidity)
    • Wearable fitness trackers (heart rate, steps)
  • Social media platforms provide user-generated content
    • Text posts (tweets, status updates)
    • Visual content (Instagram photos, TikTok videos)
  • initiatives increase data accessibility
    • (census data, crime statistics)
    • (genomic sequences, climate records)

Data Structures

Structured Data

  • Follows predefined schema or model
  • Organized in tables with rows and columns
  • Examples include relational databases and spreadsheets
  • Easily queried using standard query languages (SQL)
  • Supports efficient indexing and searching
  • Ideal for financial records, inventory management, and customer databases

Semi-Structured Data

  • Possesses some organizational properties without rigid structure
  • Common formats include XML and JSON
  • Allows for nested data structures and flexible schemas
  • Supports tags or metadata for improved organization
  • Used in web services, configuration files, and document databases
  • Requires more complex parsing compared to

Unstructured Data

  • Lacks predefined data model or organization
  • Includes free-form text, images, audio, and video files
  • Challenging to query and analyze using traditional methods
  • Requires advanced techniques for information extraction
    • for text analysis
    • for image and video processing
  • Examples include social media posts, email content, and surveillance footage

Data Types and Limitations

Numeric Data Types

  • Integers represent whole numbers
    • Limited by available bits (32-bit, 64-bit)
  • represent decimal values
    • Subject to rounding errors in calculations
  • combine real and imaginary components
    • Used in scientific and engineering applications

Categorical and Textual Data

  • represents discrete categories or labels
    • (unordered categories like colors)
    • (ordered categories like education levels)
  • encompasses strings of characters
    • Requires specialized processing techniques (tokenization, stemming)
    • Challenges include handling different languages and encodings

Temporal and Geospatial Data

  • includes dates and times
    • Requires careful handling of time zones and formats
    • Supports time-series analysis and event sequencing
  • represents geographic locations and shapes
    • Utilizes specialized data structures (points, lines, polygons)
    • Enables spatial analysis and mapping applications

Binary and Specialized Data

  • represents non-textual information
    • Includes images, audio files, and compiled software
    • Requires specific tools and libraries for manipulation
  • Choice of data type impacts:
    • Storage requirements (space efficiency)
    • Processing speed (computational complexity)
    • Available analysis techniques (statistical methods, machine learning algorithms)

Data Quality and Reliability

Data Quality Dimensions

  • Accuracy measures correctness of data values
  • Completeness assesses presence of all required data
  • Consistency ensures data aligns across different sources or time periods
  • Timeliness evaluates how up-to-date the data is
  • Validity confirms data adheres to defined rules or constraints
  • Uniqueness prevents duplicate records or redundant information

Evaluating Data Source Reliability

  • Assess reputation of data source
    • Consider credibility of organization or individual providing data
  • Examine data collection methodologies
    • Evaluate sampling techniques, survey design, or measurement processes
  • Review documentation of data provenance
    • Trace data lineage and transformation steps
  • Implement and preprocessing techniques
    • Handle missing values through imputation or deletion
    • Detect and address outliers
    • Normalize data to common scales or formats

Enhancing Data Reliability

  • Employ version control for datasets
    • Track changes and updates over time
  • Maintain data lineage records
    • Document sources, transformations, and usage of data
  • Utilize cross-validation techniques
    • Compare data across multiple sources for consistency
  • Consider ethical implications of data usage
    • Ensure compliance with privacy regulations (GDPR, CCPA)
    • Obtain informed consent for data collection when necessary
  • Address potential biases in data collection
    • Recognize sampling biases (selection bias, response bias)
    • Mitigate biases through statistical techniques or diverse data sources
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary