📊Business Intelligence Unit 2 – Data Warehousing: Concepts & Architecture

Data warehousing is a critical component of modern business intelligence. It centralizes an organization's historical data from various sources, optimizing it for complex queries and analysis. This enables informed decision-making and long-term trend identification. The data warehouse architecture typically follows a three-tier model, separating data sources, the central repository, and user access tools. Key components include source systems, staging areas, ETL processes, and query tools. Understanding these elements is crucial for effective data management and analysis.

Study Guides for Unit 2

2.1

Data Warehouse Fundamentals and Design

3 min read

2.2

Dimensional Modeling and Star Schema

3 min read

2.3

Data Warehouse Architecture Types

3 min read

2.4

Operational Data Stores and Data Marts

3 min read

What's Data Warehousing?

Centralized repository for all an organization's historical data used for reporting and analysis
Integrates data from multiple heterogeneous sources (transactional databases, operational data stores, external sources) into a single unified schema
Optimized for querying and analysis rather than transaction processing enables complex queries and data mining
Stores large volumes of historical data (terabytes to petabytes) over extended periods (years to decades) for long-term trend analysis
Typically updated in batches (daily, weekly) rather than real-time to provide a stable, consistent view of data over time
Follows a subject-oriented design organizing data around key business subjects (customers, products, sales) rather than operational processes
Employs a non-volatile storage approach data is read-only once loaded and not changed, only added to over time

Why Do We Need It?

Enables informed strategic decision-making by providing a comprehensive, integrated view of enterprise-wide data
Overcomes limitations of operational databases not optimized for complex queries and analytics
Improves data quality and consistency through data integration and cleansing during ETL process
Enhances performance of analytical queries by separating them from transactional workloads
Provides a historical record of business data over time for identifying long-term trends and patterns
Facilitates data consolidation from multiple systems and business units into an enterprise-wide view
Supports ad hoc queries, data mining, and advanced analytics not feasible on operational databases
Reduces IT costs and complexity by offloading analytical workloads from transactional systems

Key Components of a Data Warehouse

Source systems originating operational databases and external data sources feeding the warehouse
Staging area temporary storage for data extraction, transformation, and cleansing before loading into warehouse
ETL (Extract, Transform, Load) tools and processes for extracting data from sources, converting it to warehouse format, and loading it
Central repository the core data warehouse database storing integrated, historical data in a denormalized, subject-oriented schema
- Fact tables store quantitative metrics (sales amount, quantity) about business events and processes
- Dimension tables store descriptive attributes (customer name, product category) providing context for facts
Data marts subsets of warehouse data focused on specific business functions (marketing, finance) or departments
Metadata repository stores information about the structure, meaning, and lineage of data in the warehouse
Query and reporting tools allow users to access, analyze, and visualize warehouse data (OLAP, data mining, dashboards)

Data Warehouse Architecture

Three-tier architecture separates data sources, warehouse repository, and end-user access and analysis
- Bottom tier source systems and staging area for ETL
- Middle tier central data warehouse repository and data marts
- Top tier end-user query, reporting, and analysis tools
Hub-and-spoke architecture uses a central data warehouse (hub) to feed multiple dependent data marts (spokes)
Bus architecture uses conformed dimensions to create a standardized, reusable data model across data marts
Federated architecture leaves data in source systems and uses a virtual data warehouse to provide a unified view
Hybrid architecture combines elements of different architectures (hub-and-spoke fed by bus) to meet specific needs

ETL Process Explained

Extraction retrieves data from source systems into staging area can be full or incremental based on data volumes and update frequency
Transformation converts extracted data into warehouse format resolves data quality issues, applies business rules, and derives new calculated fields
- Data cleansing corrects errors, removes duplicates, and standardizes formats
- Data integration combines data from multiple sources into a unified schema
- Data aggregation summarizes detailed data into higher-level totals for analysis
Loading moves transformed data into the warehouse tables and indexes are created to optimize query performance
ETL workflows are scheduled, logged, and monitored to ensure data freshness, quality, and availability
Changed Data Capture (CDC) identifies and extracts only new or modified data to streamline incremental loads
Data lineage tracks the sources, transformations, and dependencies of warehouse data for auditing and troubleshooting

Types of Data Warehouses

Enterprise Data Warehouse (EDW) provides a centralized, comprehensive view of data across the entire organization
Operational Data Store (ODS) contains near real-time data from operational systems for tactical reporting and analysis
Data Mart focuses on a specific business function (marketing) or department (sales) and contains a subset of EDW data
- Dependent data marts source data from the EDW for a unified, consistent view
- Independent data marts source data directly from operational systems for faster, more flexible deployment
Cloud Data Warehouse uses cloud computing and storage to provide scalability, elasticity, and cost-efficiency (Amazon Redshift, Google BigQuery)
Real-time Data Warehouse supports near real-time data ingestion and analysis for operational reporting and decision-making

Data Modeling for Warehouses

Dimensional modeling organizes data into facts (quantitative) and dimensions (qualitative) for intuitive querying and analysis
- Star schema has a central fact table connected to surrounding dimension tables resembles a star
- Snowflake schema normalizes dimension tables into multiple related tables resembles a snowflake
Fact tables store measurements, metrics, or events (sales amount, quantity sold) related to a business process
- Composite key consists of foreign keys from connected dimension tables
- Fact less tables have no measures only foreign keys used to analyze many-to-many relationships
Dimension tables store descriptive attributes (product name, category) that provide context for fact data
- Denormalized and redundant to simplify queries and improve performance
- Slowly Changing Dimensions (SCDs) handle changes to dimension attributes over time
Surrogate keys replace natural keys to simplify joins, improve performance, and handle changes

Real-World Applications

Retail analyzes sales, inventory, and customer data to optimize product assortment, pricing, and promotions
Healthcare integrates patient, clinical, and financial data to improve care quality, outcomes, and cost-effectiveness
Financial Services analyzes customer, account, and transaction data for fraud detection, risk management, and cross-selling
Telecommunications mines call detail records, network data, and customer data for churn analysis, network optimization, and marketing
Manufacturing integrates supply chain, production, and quality data to optimize inventory, throughput, and product quality
Government agencies integrate data from multiple departments to improve services, reduce fraud, and enhance security
Higher Education analyzes student, course, and alumni data to improve enrollment, retention, and fundraising

Challenges and Limitations

Data quality issues from source systems (inconsistencies, errors, duplicates) can propagate into the warehouse if not resolved during ETL
Data integration challenges arise when combining data from disparate sources with different schemas, formats, and semantics
Scalability and performance issues can occur as data volumes grow, requiring careful design and optimization of warehouse architecture and queries
High implementation and maintenance costs for hardware, software, and personnel can be a barrier for smaller organizations
Lack of real-time data can limit the warehouse's usefulness for operational decision-making and reporting
Resistance to change from business users accustomed to existing reporting and analysis tools can slow adoption
Data governance and security concerns arise from centralizing sensitive data from multiple sources into one repository
Keeping up with evolving data sources and business requirements requires ongoing warehouse maintenance and updates