You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

is crucial for ensuring accurate and reliable information in business intelligence. It involves identifying and correcting errors, handling , and removing . These techniques are essential for maintaining data integrity and making informed decisions.

and standardization further enhance the value of cleansed data. By adding external information and ensuring consistent formats, businesses can create a more comprehensive and usable dataset. This process improves data quality and enables more effective analysis and reporting.

Data Cleansing Techniques

Data cleansing for error correction

Top images from around the web for Data cleansing for error correction
Top images from around the web for Data cleansing for error correction
  • examines data to identify potential issues
    • occur when data is not present or recorded
    • arise when data is stored in different ways (MM/DD/YYYY vs DD-MM-YYYY)
    • are data points that significantly deviate from the norm (a customer age of 150 years old)
    • Duplicates are repeated records or entries in the dataset
  • ensures data and by defining rules and constraints
    • Identifies and corrects data that violates these rules (a negative age value or a future date of birth)
  • handles missing values through various methods
    • Removes records with missing values to maintain data integrity
    • Fills missing values with a default value or a calculated value (mean age, median income, or mode of transportation)
    • Uses advanced techniques like k-nearest neighbors or regression to estimate missing values based on similar records
  • and treatment identifies and addresses data points that deviate significantly from the norm
    • Decides whether to remove, replace, or keep outliers based on their impact and the business context (removing a customer age of 150 years old as it is likely an error)
  • removes duplicate records to avoid data redundancy and inconsistency
    • identifies identical records based on all attributes
    • accounts for minor variations ("John Doe" vs "John E. Doe")
    • defines custom criteria for identifying duplicates (matching based on email address or phone number)

Data Enrichment and Standardization

Data enrichment with external sources

  • enhances existing data by adding new attributes or features from external sources
    • Appends demographic data to customer records (age, income, or education level)
    • Incorporates weather data to analyze sales trends (temperature, precipitation, or humidity)
  • combines data from multiple sources to create a unified view
    • Ensures data consistency and compatibility across sources (merging customer data from different databases)
  • converts data from one format or structure to another
    • Enables data to be used effectively across different systems or applications (converting XML to JSON or vice versa)

Standardization and normalization of datasets

  • ensures consistent representation of data values
    • Converts all date formats to a single, standard format (YYYY-MM-DD)
    • Standardizes units of measurement (converting all lengths to meters or temperatures to Celsius)
  • organizes data to minimize redundancy and dependency
    • Applies normal forms (1NF, 2NF, 3NF) to database tables to reduce data anomalies and improve data integrity
      1. 1NF: Each column contains atomic values, and each record has a unique identifier
      2. 2NF: No non-prime attribute is functionally dependent on a part of a composite key
      3. 3NF: No transitive dependencies between non-prime attributes
  • establish consistent naming rules for variables, columns, and tables
    • Improves data readability and maintainability (using snake_case or camelCase consistently)

Evaluating Data Cleansing and Enrichment Effectiveness

Evaluation of data quality techniques

  • assess various aspects of data quality
    • Accuracy measures the extent to which data correctly represents reality (a customer's recorded age matches their actual age)
    • indicates the proportion of data that is present and not missing (all required fields in a form are filled out)
    • Consistency ensures data is free from contradictions and adheres to a defined format (all phone numbers follow the same structure)
    • refers to the freshness and availability of data when needed (real-time stock prices for investment decisions)
    • confirms that data conforms to defined business rules and constraints (email addresses contain an "@" symbol and a domain name)
  • define quantitative measures for each data quality dimension
    • Accuracy: Percentage of correct values (95% of addresses are verified as correct)
    • : Percentage of non-missing values (98% of customer records have a valid email address)
    • Consistency: Number of data inconsistencies or anomalies (0.5% of records have conflicting information)
    • Timeliness: Average data latency or age (sales data is updated every 15 minutes)
  • Business requirements alignment assesses the impact of data cleansing and enrichment on business objectives
    • Measures improvements in decision-making accuracy, operational efficiency, customer satisfaction, and regulatory compliance
  • Continuous monitoring and improvement regularly assesses to identify areas for improvement
    • Adapts data cleansing and enrichment processes based on changing business needs and data landscapes (updating data validation rules as new products are introduced)
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary