You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

3.3 Data Transformation and Cleansing

3 min readjuly 18, 2024

and cleansing are vital steps in the ETL process, ensuring data quality and usability for business intelligence. These processes convert data from various sources into a unified format, resolving issues like missing values, outliers, and inconsistencies.

Techniques like , normalization, and aggregation transform data for analysis. Cleansing strategies include , treating outliers, and resolving inconsistencies. These steps are crucial for accurate decision-making based on clean, trustworthy data.

Data Transformation and Cleansing in ETL

Purpose of data transformation and cleansing

Top images from around the web for Purpose of data transformation and cleansing
Top images from around the web for Purpose of data transformation and cleansing
  • Critical steps in the Extract, Transform, Load (ETL) process ensure data quality, consistency, and usability for business intelligence and analytics
  • Data transformation converts and standardizes data from various sources (databases, files, APIs) into a unified format enabling seamless integration and analysis
  • identifies and resolves data quality issues (missing values, outliers, inconsistencies) improving , completeness, and reliability for enhanced decision-making and reporting based on clean and trustworthy data

Impact of data quality issues

  • Missing values occur when records or fields lack information leading to inaccurate analysis and skewed results (null values, blank cells)
  • Outliers are data points that significantly deviate from the normal range or distribution (extreme values, anomalies) and may distort statistical measures and affect decision-making
  • Inconsistencies arise from contradictory or conflicting data across different sources or within the same dataset (formatting variations, data entry errors) hindering data integration and leading to confusion and misinterpretation
  • Duplicates are redundant records or entries representing the same entity (repeated customer records, identical transactions) inflating data volume and leading to incorrect aggregations and analyses
  • Incorrect data types occur when there are mismatched or inappropriate data types for specific fields (storing dates as text, numbers as strings) causing errors in calculations, filtering, and reporting

Techniques for data transformation

  • Data type conversion ensures consistency and compatibility by converting data types (string to numeric, date/time formats)
  • restructures data to eliminate redundancy and improve by splitting data into separate tables based on dependencies and relationships (1NF, 2NF, 3NF)
  • combines and summarizes data at different levels of granularity enabling analysis and reporting at various dimensions (daily sales, monthly revenue, yearly growth)
  • divides data into smaller, manageable subsets based on specific criteria (customer segments, product categories) facilitating parallel processing and targeted analysis
  • enhances data with additional information from external sources (demographic data, weather information) providing more context and insights for analysis and decision-making

Strategies for data cleansing

  • :
    1. Deletion: Removing records with missing values
    2. : Estimating missing values based on available data using mean, median, or mode substitution or regression and machine learning techniques
  • and treatment:
    1. : ,
    2. Domain knowledge: Defining acceptable ranges based on business rules
    3. : Replacing extreme values with the nearest valid value
  • Resolving inconsistencies:
    1. : Applying consistent formats, units, and conventions for date formats (YYYY-MM-DD), address structures (street, city, state, zip), and naming conventions (lowercase, camel case)
    2. : Enforcing predefined rules and constraints through range checks (age between 18 and 100), pattern matching (email format), and cross-field validation (end date after start date)
  • involves identifying and removing duplicate records and merging or consolidating duplicates into a single representative record (customer records, product entries)
  • analyzes data characteristics, patterns, and quality metrics helping identify and prioritize data quality issues for cleansing efforts (data distribution, value frequencies, data types)
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary