Data Warehouse Concepts - Interview Q&A

Schema Design & Architecture

1. What are Facts and Dimensions? Explain with examples.

Answer:

• Facts: Quantitative, measurable data that represents business metrics

– Examples: Sales amount, quantity sold, profit, revenue

– Characteristics: Numeric, additive, time-variant

• Dimensions: Descriptive attributes that provide context to facts

– Examples: Customer, Product, Time, Location

– Characteristics: Textual, categorical, hierarchical

Real-world scenario: In a retail sales fact table, sales_amount is a fact, while customer_name, product_category, and sale_date are dimensions.

2. Are Facts and Dimensional tables related to each other?

Answer: Yes, through foreign key relationships:

• Fact tables contain foreign keys referencing dimension tables

• This creates a star or snowflake schema

• Enables drilling down and slicing/dicing of data

• Supports OLAP operations and analytics

3. When loading data into Facts and Dimension tables, which one will you load first?

Answer: Dimensions first, then Facts because:

• Facts contain foreign keys referencing dimensions

• Referential integrity must be maintained

• Dimension keys are needed before fact loading

• Loading sequence: Dimensions → Bridge tables → Facts

Best Practice: Use surrogate keys and implement proper error handling for missing dimension references.

4. Difference between Star and Snowflake schemas?

Answer:

Aspect	Star Schema	Snowflake Schema
Structure	Denormalized dimensions	Normalized dimensions
Complexity	Simple, flat structure	Complex, hierarchical
Storage	More storage space	Less storage space
Query Performance	Faster queries	Slower due to joins
Maintenance	Easier to maintain	More complex maintenance
Data Redundancy	Higher redundancy	Lower redundancy

When to use Star: OLAP, reporting, simple analytics

When to use Snowflake: Data integrity critical, storage optimization

5. What is SCD (Slowly Changing Dimensions)?

Answer: SCD handles changes in dimension data over time:

SCD Type 0: No changes allowed SCD Type 1: Overwrite old values SCD Type 2: Create new records with versioning SCD Type 3: Add new columns for changes SCD Type 4: Use separate history table SCD Type 6: Combination of 1, 2, and 3

6. SCD Type 2 Scenario and Implementation

Scenario: Customer changes address from “Bangalore” to “Hyderabad”

Before Change:

After Change:

Implementation Flags:

• Current_Flag: Y/N

• Effective_Date: Start date

• Expiry_Date: End date

• Version_Number: 1, 2, 3…

7. What is a Data Mart?

Answer:

• Subset of data warehouse focused on specific business area

• Subject-oriented: Sales, Finance, HR marts

• Faster query performance due to smaller size

• Departmental access with relevant data only

• Types: Dependent (from DW) vs Independent (direct from sources)

8. Explain OLTP vs OLAP

Answer:

Aspect	OLTP	OLAP
Purpose	Transaction processing	Analytical processing
Data	Current, operational	Historical, summarized
Schema	Normalized (3NF)	Denormalized (Star/Snowflake)
Queries	Simple, frequent	Complex, ad-hoc
Users	Many concurrent	Few analytical
Response Time	Milliseconds	Seconds to minutes

9. What is Data Lineage?

Answer:

• Tracks data flow from source to target

• Documents transformations applied to data

• Enables impact analysis for changes

• Supports compliance and auditing

• Tools: Informatica Metadata Manager, Collibra

10. Explain Conformed Dimensions

Answer:

• Shared dimensions across multiple fact tables

• Same structure and values in all uses

• Enables drill-across queries between facts

• Example: Time dimension used by Sales and Inventory facts

• Benefits: Consistency, reusability, integrated reporting

Advanced DWH Concepts

11. What is a Surrogate Key and why use it?

Answer:

• System-generated unique identifier (usually integer)

• Independent of business data changes

• Benefits:

– Performance (integer joins faster)

– SCD Type 2 support

– Handles source system changes

– Smaller storage footprint

12. Explain Type 2 Dimension Handling Strategies

Answer:

Version Flag Method:

Current_Flag: Y/N indicating active record

Effective Date Method:

Start_Date and End_Date for validity period

Version Number Method:

Version: 1, 2, 3... for each change

Best Practice: Combine multiple methods for robustness.

13. What is Aggregate Tables and when to use them?

Answer:

• Pre-calculated summary tables for faster queries

• When to use:

– Frequently accessed summary data

– Complex calculations taking long time

– Large fact tables with millions of records

• Types: Daily, Monthly, Yearly aggregates

• Maintenance: Refresh strategies (incremental/full)

14. Explain Factless Fact Tables

Answer:

• Fact tables without measures

• Capture events or relationships

• Examples:

– Student course enrollment

– Product promotions

– Employee training attendance

• Contains only foreign keys to dimensions

• Used for counting occurrences of events

15. What is Change Data Capture (CDC)?

Answer:

• Identifies and captures changes in source systems

• Types:

– Timestamp-based: Using modification dates

– Trigger-based: Database triggers capture changes

– Log-based: Reading transaction logs

– Snapshot comparison: Compare current vs previous

• Benefits: Efficient incremental loading, reduced processing time

16. What is a Staging Area and its purpose?

Answer: Purpose of Staging Area:

• Temporary storage for raw source data

• Data validation and cleansing before loading

• Transformation processing area

• Error handling and data quality checks

• Recovery point for failed loads

Staging Area Characteristics:

• Minimal transformation - raw data storage

• Short retention period (24-48 hours typically)

• No complex business rules applied

• Same structure as source or flat structure

Implementation:

-- Staging table structure
CREATE TABLE stg_customers (
    customer_id NUMBER,
    customer_name VARCHAR2(100),
    address VARCHAR2(200),
    phone VARCHAR2(15),
    email VARCHAR2(100),
    load_date DATE DEFAULT SYSDATE,
    source_system VARCHAR2(20),
    record_status VARCHAR2(10) DEFAULT 'NEW'
);

17. Explain Kimball vs Inmon Methodology

Answer:

Aspect	Kimball (Bottom-Up)	Inmon (Top-Down)
Approach	Dimensional modeling first	Enterprise data model first
Start Point	Data marts	Enterprise data warehouse
Implementation	Faster, iterative	Longer, comprehensive
Data Storage	Denormalized (star schema)	Normalized (3NF)
Cost	Lower initial cost	Higher initial investment
Business Value	Quick wins, faster ROI	Long-term enterprise solution
Complexity	Simpler to understand	More complex architecture

Kimball Approach:

1. Identify business process

2. Declare the grain

3. Identify dimensions

4. Identify facts

Inmon Approach:

1. Build enterprise data model

2. Create integrated data warehouse

3. Build data marts from EDW

4. Apply business rules consistently

18. What is Grain in Dimensional Modeling?

Answer: Definition: The most atomic level of detail captured in a fact table

Grain Examples:

• Sales Fact: One row per line item per transaction

• Inventory Fact: One row per product per location per day

• Web Analytics: One row per page view per session

Grain Declaration Process:

1. Business Requirements: What questions need answers?

2. Source System Analysis: What detail is available?

3. Performance Considerations: Storage and query performance

4. Future Flexibility: Ability to drill down

Impact on Design:

• Determines fact table size and performance

• Affects dimension relationships

• Influences aggregation strategies

• Drives ETL complexity

-- Example: Sales fact table grain
CREATE TABLE sales_fact (
    -- Grain: One row per product sold per transaction
    date_key INTEGER,
    customer_key INTEGER,
    product_key INTEGER,
    store_key INTEGER,
    transaction_id VARCHAR(20),
    line_number INTEGER,
    -- Facts/Measures
    quantity_sold NUMBER,
    unit_price NUMBER(10,2),
    extended_amount NUMBER(12,2),
    discount_amount NUMBER(10,2)
);

19. What are Conformed Dimensions and their benefits?

Answer: Definition: Shared dimensions across multiple fact tables with identical structure and content

Benefits:

• Drill-across queries between different facts

• Consistent reporting across business processes

• Reduced storage through dimension reuse

• Simplified maintenance - single version of truth

Implementation:

-- Shared time dimension
CREATE TABLE date_dim (
    date_key INTEGER PRIMARY KEY,
    full_date DATE,
    day_of_week VARCHAR(10),
    month_name VARCHAR(10),
    quarter VARCHAR(10),
    fiscal_year INTEGER,
    is_holiday CHAR(1)
);

-- Used by multiple fact tables
CREATE TABLE sales_fact (
    date_key INTEGER REFERENCES date_dim(date_key),
    -- other dimensions and facts
);

CREATE TABLE inventory_fact (
    date_key INTEGER REFERENCES date_dim(date_key),
    -- other dimensions and facts
);

Drill-Across Query Example:

-- Compare sales and inventory by month
SELECT
    d.month_name,
    s.total_sales,
    i.avg_inventory
FROM date_dim d
LEFT JOIN (
    SELECT date_key, SUM(extended_amount) AS total_sales
    FROM sales_fact GROUP BY date_key
) s ON d.date_key = s.date_key
LEFT JOIN (
    SELECT date_key, AVG(quantity_on_hand) AS avg_inventory
    FROM inventory_fact GROUP BY date_key
) i ON d.date_key = i.date_key
WHERE d.fiscal_year = 2024;

20. Explain Type 4 and Type 6 SCD implementations

Answer:

SCD Type 4 - History Table:

• Current table: Latest values only

• History table: All historical versions

• Benefits: Fast current data access, complete history

-- Current customer table
CREATE TABLE customer_current (
    customer_id INTEGER PRIMARY KEY,
    customer_name VARCHAR(100),
    address VARCHAR(200),
    phone VARCHAR(15),
    email VARCHAR(100),
    last_updated TIMESTAMP
);

-- Customer history table
CREATE TABLE customer_history (
    customer_id INTEGER,
    customer_name VARCHAR(100),
    address VARCHAR(200),
    phone VARCHAR(15),
    email VARCHAR(100),
    effective_date DATE,
    end_date DATE,
    change_reason VARCHAR(50),
    PRIMARY KEY (customer_id, effective_date)
);

SCD Type 6 - Hybrid Approach (1+2+3):

• Type 1: Current value always updated

• Type 2: Historical versions maintained

• Type 3: Previous value stored in separate column

CREATE TABLE customer_type6 (
    customer_key INTEGER PRIMARY KEY,        -- Surrogate key
    customer_id INTEGER,                     -- Natural key
    customer_name_current VARCHAR(100),      -- Type 1: Always current
    customer_name_previous VARCHAR(100),     -- Type 3: Previous value
    customer_name_original VARCHAR(100),     -- Type 3: Original value
    address_current VARCHAR(200),            -- Type 1: Always current
    address_historical VARCHAR(200),         -- Type 2: Historical value
    effective_date DATE,                     -- Type 2: Effective date
    expiry_date DATE,                       -- Type 2: Expiry date
    current_flag CHAR(1),                   -- Type 2: Current indicator
    version_number INTEGER                   -- Type 2: Version tracking
);

21. What is a Bus Matrix and its importance?

Answer: Definition: Enterprise-wide matrix showing business processes and their associated dimensions

Components:

• Rows: Business processes (Order Management, Inventory, HR)

• Columns: Dimensions (Customer, Product, Time, Location)

• Intersections: Mark which dimensions apply to each process

Benefits:

• Enterprise planning roadmap

• Dimension standardization across projects

• Integration opportunities identification

• Resource allocation planning

Example Bus Matrix:

Business Process	Customer	Product	Time	Location	Employee	Vendor
Sales	X	X	X	X
Inventory		X	X	X
Purchasing		X	X	X		X
HR			X	X	X
Finance	X		X	X	X	X

22. Explain Late Arriving Dimensions and handling strategies

Answer: Problem: Fact records arrive before their corresponding dimension records

Handling Strategies:

1. Unknown Member Technique:

-- Create unknown/default member
INSERT INTO customer_dim VALUES (
    -1, 'UNKNOWN', 'Unknown Customer', 'TBD', 'TBD', 'TBD',
    '1900-01-01', '9999-12-31', 'Y', 1
);

-- Use unknown key for late arriving facts
INSERT INTO sales_fact
SELECT
    date_key,
    COALESCE(c.customer_key, -1) AS customer_key,
    product_key,
    sales_amount
FROM staging_sales s
LEFT JOIN customer_dim c ON s.customer_id = c.customer_id;

2. Inferred Member Technique:

-- Create minimal dimension record when fact arrives
INSERT INTO customer_dim (
    customer_key, customer_id, customer_name,
    start_date, end_date, current_flag, inferred_flag
)
SELECT
    customer_seq.NEXTVAL, customer_id, 'Inferred Member',
    SYSDATE, '9999-12-31', 'Y', 'Y'
FROM staging_sales s
WHERE NOT EXISTS (
    SELECT 1 FROM customer_dim c
    WHERE c.customer_id = s.customer_id
);

-- Update when actual dimension data arrives
UPDATE customer_dim
SET customer_name = actual_name,
    address = actual_address,
    inferred_flag = 'N'
WHERE customer_id = incoming_customer_id
AND inferred_flag = 'Y';

23. What is a Factless Fact Table?

Answer: Definition: Fact table containing only foreign keys to dimensions, no numeric measures

Types:

1. Coverage/Event Factless Facts:

• Records events or relationships

• Example: Student course enrollment, employee training

CREATE TABLE student_enrollment_fact (
    date_key INTEGER,
    student_key INTEGER,
    course_key INTEGER,
    instructor_key INTEGER,
    classroom_key INTEGER,
    -- No measures, just the relationship
    PRIMARY KEY (date_key, student_key, course_key)
);

2. Snapshot/Condition Factless Facts:

• Records states or conditions at points in time

• Example: Account balances, inventory positions

CREATE TABLE promotion_coverage_fact (
    date_key INTEGER,
    customer_key INTEGER,
    product_key INTEGER,
    promotion_key INTEGER,
    store_key INTEGER,
    -- Records which customers were eligible for which promotions
    PRIMARY KEY (date_key, customer_key, product_key, promotion_key)
);

Query Examples:

-- Count students enrolled in each course
SELECT
    c.course_name,
    COUNT(*) AS enrollment_count
FROM student_enrollment_fact sef
JOIN course_dim c ON sef.course_key = c.course_key
JOIN date_dim d ON sef.date_key = d.date_key
WHERE d.semester = 'Fall 2024'
GROUP BY c.course_name;

-- Find customers NOT in a promotion
SELECT customer_key
FROM customer_dim
WHERE customer_key NOT IN (
    SELECT DISTINCT customer_key
    FROM promotion_coverage_fact
    WHERE promotion_key = 123
);

24. Explain Data Vault Methodology

Answer: Components:

1. Hubs: Business keys and metadata

CREATE TABLE hub_customer (
    customer_hash_key CHAR(32) PRIMARY KEY,
    customer_id VARCHAR(20),
    load_date TIMESTAMP,
    record_source VARCHAR(50)
);

2. Links: Relationships between business keys

CREATE TABLE link_customer_order (
    customer_order_hash_key CHAR(32) PRIMARY KEY,
    customer_hash_key CHAR(32),
    order_hash_key CHAR(32),
    load_date TIMESTAMP,
    record_source VARCHAR(50)
);

3. Satellites: Descriptive attributes

CREATE TABLE sat_customer_details (
    customer_hash_key CHAR(32),
    load_date TIMESTAMP,
    customer_name VARCHAR(100),
    address VARCHAR(200),
    phone VARCHAR(15),
    email VARCHAR(100),
    hash_diff CHAR(32),
    record_source VARCHAR(50),
    PRIMARY KEY (customer_hash_key, load_date)
);

Benefits:

• Audit trail by design

• Parallel loading capabilities

• Source system agnostic

• Highly scalable architecture

25. What is OLAP and types of OLAP operations?

Answer: OLAP Operations:

1. Drill Down: Navigate from summary to detail

-- From yearly to monthly to daily
SELECT year, month, day, SUM(sales_amount)
FROM sales_fact sf
JOIN date_dim d ON sf.date_key = d.date_key
GROUP BY year, month, day;

2. Drill Up (Roll Up): Navigate from detail to summary

-- From daily to monthly
SELECT year, month, SUM(sales_amount)
FROM sales_fact sf
JOIN date_dim d ON sf.date_key = d.date_key
GROUP BY year, month;

3. Slice: Fix one dimension, analyze others

-- Slice by specific product category
SELECT customer_region, time_period, SUM(sales_amount)
FROM sales_fact sf
JOIN product_dim p ON sf.product_key = p.product_key
WHERE p.category = 'Electronics';

4. Dice: Filter multiple dimensions

-- Dice by region, time period, and product
SELECT SUM(sales_amount)
FROM sales_fact sf
JOIN customer_dim c ON sf.customer_key = c.customer_key
JOIN date_dim d ON sf.date_key = d.date_key
JOIN product_dim p ON sf.product_key = p.product_key
WHERE c.region IN ('North', 'South')
AND d.year = 2024
AND p.category = 'Electronics';

5. Pivot (Rotation): Swap rows and columns

-- Pivot months to columns
SELECT * FROM (
    SELECT region, month_name, sales_amount
    FROM sales_summary
) PIVOT (
    SUM(sales_amount)
    FOR month_name IN ('Jan', 'Feb', 'Mar', 'Apr')
);

26-50. Additional DWH Concepts

26. Dimensional Modeling Best Practices

Answer: Design Guidelines:

• Bus Matrix for enterprise planning

• Conformed dimensions for integration

• Consistent grain declaration

• Descriptive attribute naming

• Surrogate keys for all dimensions

• Graceful handling of data quality issues

Naming Conventions:

-- Dimension tables: [subject]_dim
customer_dim, product_dim, date_dim

-- Fact tables: [process]_fact
sales_fact, inventory_fact, shipment_fact

-- Keys: [table]_key
customer_key, product_key, date_key

-- Measures: descriptive names
quantity_sold, extended_amount, discount_amount

27. Data Quality Framework

Answer: Data Quality Dimensions:

• Completeness: No missing values

• Accuracy: Correct values

• Consistency: Same values across systems

• Validity: Values within acceptable ranges

• Timeliness: Data available when needed

• Uniqueness: No duplicates

Implementation:

-- Data quality checks
CREATE TABLE data_quality_rules (
    rule_id INTEGER,
    table_name VARCHAR(50),
    column_name VARCHAR(50),
    rule_type VARCHAR(20),
    rule_definition TEXT,
    threshold_value NUMBER
);

-- Quality monitoring
CREATE TABLE data_quality_results (
    execution_date DATE,
    rule_id INTEGER,
    records_checked INTEGER,
    records_failed INTEGER,
    failure_rate NUMBER,
    status VARCHAR(10)
);

28. Master Data Management (MDM)

Answer: MDM Components:

• Golden Record: Single version of truth

• Data Governance: Policies and procedures

• Data Stewardship: Ownership and accountability

• Match and Merge: Duplicate identification

• Data Lineage: Source tracking

Implementation Approach:

-- Customer master with confidence scoring
CREATE TABLE customer_master (
    master_customer_id INTEGER,
    source_system VARCHAR(20),
    source_customer_id VARCHAR(50),
    customer_name VARCHAR(100),
    confidence_score NUMBER(3,2),
    match_rule VARCHAR(50),
    created_date TIMESTAMP,
    last_updated TIMESTAMP
);

29. Real-time Data Warehousing

Answer: Architecture Components:

• Change Data Capture (CDC)

• Message Queues (Kafka, MQ)

• Stream Processing (Storm, Spark)

• In-Memory Databases

• Operational Data Store (ODS)

Implementation Patterns:

-- Near real-time using triggers
CREATE OR REPLACE TRIGGER customer_cdc_trigger
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW
BEGIN
    INSERT INTO cdc_customer_changes VALUES (
        CASE
            WHEN INSERTING THEN 'I'
            WHEN UPDATING THEN 'U'
            WHEN DELETING THEN 'D'
        END,
        COALESCE(:NEW.customer_id, :OLD.customer_id),
        SYSDATE
    );
END;

30. Data Vault 2.0 Enhancements

Answer: New Features:

• Hash Keys: For performance and distribution

• Business Vault: Calculated and derived data

• NoSQL Integration: Handle unstructured data

• Parallel Processing: Optimized for big data

-- Hash key generation
CREATE OR REPLACE FUNCTION generate_hash_key(
    p_business_key VARCHAR2
) RETURN VARCHAR2 IS
BEGIN
    RETURN UPPER(
        DBMS_CRYPTO.HASH(
            UTL_RAW.CAST_TO_RAW(UPPER(TRIM(p_business_key))),
            DBMS_CRYPTO.HASH_MD5
        )
    );
END;

31. Lambda Architecture for Data Warehousing

Answer: Three Layers:

• Batch Layer: Historical data processing

• Speed Layer: Real-time stream processing

• Serving Layer: Query interface

Implementation:

-- Batch view (historical)
CREATE TABLE batch_customer_metrics AS
SELECT
    customer_id,
    date_key,
    SUM(purchase_amount) AS total_purchases,
    COUNT(*) AS transaction_count
FROM sales_history
GROUP BY customer_id, date_key;

-- Real-time view (current)
CREATE TABLE realtime_customer_metrics AS
SELECT
    customer_id,
    CURRENT_DATE AS date_key,
    SUM(purchase_amount) AS total_purchases,
    COUNT(*) AS transaction_count
FROM sales_stream
WHERE event_time >= CURRENT_DATE
GROUP BY customer_id;

32. Data Warehouse Automation

Answer: Automation Areas:

• Schema generation from source systems

• ETL code generation

• Data lineage documentation

• Test case generation

• Deployment automation

Tools and Techniques:

-- Metadata-driven ETL
CREATE TABLE etl_metadata (
    table_name VARCHAR(50),
    source_column VARCHAR(50),
    target_column VARCHAR(50),
    transformation_rule TEXT,
    data_type VARCHAR(20),
    is_key CHAR(1)
);

-- Generate ETL code from metadata
SELECT
    'INSERT INTO ' || target_table ||
    ' SELECT ' || LISTAGG(target_column, ', ') ||
    ' FROM ' || source_table AS generated_sql
FROM etl_metadata
GROUP BY target_table, source_table;

33. Cloud Data Warehousing

Answer: Cloud DW Characteristics:

• Elastic scaling based on demand

• Pay-per-use pricing models

• Managed services reduce maintenance

• Global distribution capabilities

• Integration with cloud ecosystem

Snowflake Example:

-- Virtual warehouse for different workloads
CREATE WAREHOUSE etl_warehouse
WITH WAREHOUSE_SIZE = 'LARGE'
     AUTO_SUSPEND = 60
     AUTO_RESUME = TRUE;

CREATE WAREHOUSE reporting_warehouse
WITH WAREHOUSE_SIZE = 'MEDIUM'
     AUTO_SUSPEND = 300
     AUTO_RESUME = TRUE;

-- Time travel for historical queries
SELECT * FROM customer_dim
AT (TIMESTAMP => '2024-01-01 09:00:00');

34. Data Lake Integration

Answer: Data Lake vs Data Warehouse:

Aspect	Data Lake	Data Warehouse
Schema	Schema-on-read	Schema-on-write
Data Types	Structured, semi-structured, unstructured	Structured
Processing	ELT (Extract, Load, Transform)	ETL (Extract, Transform, Load)
Cost	Lower storage cost	Higher storage cost
Agility	High agility	Lower agility

Integration Architecture:

-- External table on data lake files
CREATE EXTERNAL TABLE raw_customer_data (
    customer_json STRING
)
LOCATION 's3://datalake/customers/'
TBLPROPERTIES ('has_encrypted_data'='false');

-- Transform and load to DW
INSERT INTO customer_dim
SELECT
    customer_seq.NEXTVAL,
    json_extract_scalar(customer_json, '$.id'),
    json_extract_scalar(customer_json, '$.name'),
    json_extract_scalar(customer_json, '$.address')
FROM raw_customer_data;

35. Temporal Data Warehousing

Answer: Temporal Aspects:

• Valid Time: When fact was true in real world

• Transaction Time: When fact was stored in database

• Bitemporal: Both valid and transaction time

-- Bitemporal dimension table
CREATE TABLE customer_bitemporal (
    customer_key INTEGER,
    customer_id INTEGER,
    customer_name VARCHAR(100),
    address VARCHAR(200),
    -- Valid time
    valid_from DATE,
    valid_to DATE,
    -- Transaction time
    transaction_from TIMESTAMP,
    transaction_to TIMESTAMP,
    current_flag CHAR(1)
);

-- Query as of specific valid and transaction time
SELECT * FROM customer_bitemporal
WHERE customer_id = 123
AND valid_from <= DATE '2024-01-01'
AND valid_to > DATE '2024-01-01'
AND transaction_from <= TIMESTAMP '2024-01-01 10:00:00'
AND transaction_to > TIMESTAMP '2024-01-01 10:00:00';

36. Data Warehouse Security

Answer: Security Layers:

• Network Security: VPNs, firewalls

• Authentication: User identity verification

• Authorization: Access control

• Encryption: Data at rest and in transit

• Auditing: Access logging

Implementation:

-- Row-level security
CREATE POLICY customer_region_policy ON customer_dim
FOR SELECT TO business_user
USING (region = sys_context('APP_CTX', 'USER_REGION'));

-- Column-level encryption
ALTER TABLE customer_dim
MODIFY (ssn ENCRYPT USING 'AES256');

-- Audit trail
CREATE TABLE access_audit (
    user_name VARCHAR(50),
    object_name VARCHAR(100),
    action VARCHAR(20),
    timestamp TIMESTAMP,
    ip_address VARCHAR(15)
);

37. Data Warehouse Testing

Answer: Testing Types:

• Unit Testing: Individual transformations

• Integration Testing: End-to-end data flow

• Performance Testing: Load and stress testing

• Data Quality Testing: Completeness, accuracy

• User Acceptance Testing: Business validation

Automated Testing Framework:

-- Test case definitions
CREATE TABLE test_cases (
    test_id INTEGER,
    test_name VARCHAR(100),
    test_sql TEXT,
    expected_result TEXT,
    test_category VARCHAR(50)
);

-- Test results tracking
CREATE TABLE test_results (
    execution_id INTEGER,
    test_id INTEGER,
    execution_date TIMESTAMP,
    actual_result TEXT,
    test_status VARCHAR(10),
    execution_time_ms INTEGER
);

38. Data Warehouse Optimization

Answer: Optimization Strategies:

• Partitioning: Improve query performance

• Indexing: Accelerate data access

• Materialized Views: Pre-aggregate data

• Compression: Reduce storage requirements

• Statistics: Optimize query plans

-- Partition pruning example
CREATE TABLE sales_fact (
    sale_date DATE,
    customer_id NUMBER,
    product_id NUMBER,
    sales_amount NUMBER
)
PARTITION BY RANGE (sale_date) (
    PARTITION sales_2023 VALUES LESS THAN (DATE '2024-01-01'),
    PARTITION sales_2024_q1 VALUES LESS THAN (DATE '2024-04-01'),
    PARTITION sales_2024_q2 VALUES LESS THAN (DATE '2024-07-01')
);

-- Bitmap index for low cardinality
CREATE BITMAP INDEX idx_customer_gender
ON customer_dim(gender);

-- Aggregate table for performance
CREATE TABLE monthly_sales_summary AS
SELECT
    DATE_TRUNC('month', sale_date) AS month,
    customer_region,
    product_category,
    SUM(sales_amount) AS monthly_sales
FROM sales_fact_detailed
GROUP BY DATE_TRUNC('month', sale_date), customer_region, product_category;

39. Data Lineage and Impact Analysis

Answer: Data Lineage Components:

• Source to Target mapping

• Transformation Logic documentation

• Dependency Relationships

• Impact Analysis capabilities

• Audit Trail maintenance

-- Data lineage metadata
CREATE TABLE data_lineage (
    lineage_id INTEGER,
    source_system VARCHAR(50),
    source_table VARCHAR(100),
    source_column VARCHAR(100),
    target_system VARCHAR(50),
    target_table VARCHAR(100),
    target_column VARCHAR(100),
    transformation_rule TEXT,
    created_date TIMESTAMP
);

-- Impact analysis query
WITH RECURSIVE impact_chain AS (
    SELECT source_table, target_table, 1 AS level
    FROM data_lineage
    WHERE source_table = 'customer_source'

    UNION ALL

    SELECT dl.source_table, dl.target_table, ic.level + 1
    FROM data_lineage dl
    JOIN impact_chain ic ON dl.source_table = ic.target_table
    WHERE ic.level < 10
)
SELECT DISTINCT target_table, level
FROM impact_chain
ORDER BY level;

40. Data Governance Framework

Answer: Governance Components:

• Data Stewardship: Ownership and accountability

• Data Standards: Naming, quality, format

• Data Policies: Usage and access rules

• Data Catalog: Metadata repository

• Compliance: Regulatory requirements

-- Data stewardship tracking
CREATE TABLE data_stewards (
    steward_id INTEGER,
    steward_name VARCHAR(100),
    department VARCHAR(50),
    contact_email VARCHAR(100),
    responsibilities TEXT
);

CREATE TABLE data_assets (
    asset_id INTEGER,
    asset_name VARCHAR(100),
    asset_type VARCHAR(50),
    steward_id INTEGER,
    classification VARCHAR(20),
    sensitivity_level VARCHAR(20)
);

41. Multi-dimensional OLAP Cubes

Answer: OLAP Cube Architecture:

• Dimensions: Business perspectives (Time, Product, Customer, Geography)

• Measures: Quantitative data (Sales, Profit, Quantity)

• Hierarchies: Drill-down paths within dimensions

• Aggregations: Pre-calculated summaries at various levels

Implementation Approaches:

1. MOLAP (Multidimensional OLAP):

-- Create cube structure
CREATE CUBE sales_cube (
    DIMENSION time_dim (
        HIERARCHY time_hierarchy (
            LEVEL year,
            LEVEL quarter CHILD OF year,
            LEVEL month CHILD OF quarter,
            LEVEL day CHILD OF month
        )
    ),
    DIMENSION product_dim (
        HIERARCHY product_hierarchy (
            LEVEL category,
            LEVEL subcategory CHILD OF category,
            LEVEL product CHILD OF subcategory
        )
    ),
    DIMENSION customer_dim (
        HIERARCHY customer_hierarchy (
            LEVEL region,
            LEVEL country CHILD OF region,
            LEVEL state CHILD OF country,
            LEVEL city CHILD OF state
        )
    ),
    MEASURE sales_amount SUM,
    MEASURE quantity_sold SUM,
    MEASURE profit_margin AVG
);

2. ROLAP (Relational OLAP) Implementation:

-- Cube view using star schema
CREATE VIEW sales_cube_view AS
SELECT
    d.year, d.quarter, d.month,
    p.category, p.subcategory, p.product_name,
    c.region, c.country, c.state, c.city,
    SUM(f.sales_amount) AS total_sales,
    SUM(f.quantity) AS total_quantity,
    AVG(f.profit_margin) AS avg_profit_margin,
    COUNT(*) AS transaction_count
FROM sales_fact f
JOIN date_dim d ON f.date_key = d.date_key
JOIN product_dim p ON f.product_key = p.product_key
JOIN customer_dim c ON f.customer_key = c.customer_key
GROUP BY CUBE(
    d.year, d.quarter, d.month,
    p.category, p.subcategory, p.product_name,
    c.region, c.country, c.state, c.city
);

3. Aggregation Design:

-- Pre-computed aggregation tables
CREATE TABLE agg_sales_monthly AS
SELECT
    date_key, product_category, customer_region,
    SUM(sales_amount) AS monthly_sales,
    SUM(quantity) AS monthly_quantity,
    COUNT(*) AS transaction_count
FROM sales_fact
GROUP BY date_key, product_category, customer_region;

-- Quarterly aggregations
CREATE TABLE agg_sales_quarterly AS
SELECT
    quarter_key, product_category,
    SUM(sales_amount) AS quarterly_sales,
    AVG(monthly_sales) AS avg_monthly_sales
FROM agg_sales_monthly
GROUP BY quarter_key, product_category;

MDX Query Examples:

-- MDX query for sales analysis
SELECT
    {[Time].[2024].[Q1].[January], [Time].[2024].[Q1].[February]} ON COLUMNS,
    {[Product].[Electronics], [Product].[Clothing]} ON ROWS
FROM [Sales_Cube]
WHERE [Customer].[Region].[North America]

-- Drill-down query
SELECT
    [Time].[Year].MEMBERS ON COLUMNS,
    DESCENDANTS([Product].[Electronics], [Product].[Product], SELF) ON ROWS
FROM [Sales_Cube]

Performance Optimization:

• Aggregation Strategy: 80/20 rule - pre-compute 20% aggregations for 80% queries

• Partitioning: Partition cubes by time periods

• Caching: Implement intelligent caching for frequently accessed data

• Incremental Processing: Update only changed partitions

42. Data Warehouse Backup and Recovery

Answer: Backup Strategy Components:

1. Backup Types:

-- Full backup strategy
BACKUP DATABASE datawarehouse
TO DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH COMPRESSION, CHECKSUM, STATS = 10;

-- Differential backup
BACKUP DATABASE datawarehouse
TO DISK = 'C:\Backup\DW_Diff_20240315.bak'
WITH DIFFERENTIAL, COMPRESSION, CHECKSUM;

-- Transaction log backup
BACKUP LOG datawarehouse
TO DISK = 'C:\Backup\DW_Log_20240315.trn'
WITH COMPRESSION, CHECKSUM;

2. Backup Scheduling:

#!/bin/bash
# Automated backup schedule

# Daily differential backups
0 2 * * * /scripts/backup_differential.sh

# Weekly full backups
0 1 * * 0 /scripts/backup_full.sh

# Hourly transaction log backups
0 * * * * /scripts/backup_transaction_log.sh

# Monthly archive to remote storage
0 3 1 * * /scripts/backup_archive.sh

3. Recovery Scenarios:

Point-in-Time Recovery:

-- Restore to specific point in time
RESTORE DATABASE datawarehouse
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH NORECOVERY, REPLACE;

RESTORE DATABASE datawarehouse
FROM DISK = 'C:\Backup\DW_Diff_20240315.bak'
WITH NORECOVERY;

RESTORE LOG datawarehouse
FROM DISK = 'C:\Backup\DW_Log_20240315.trn'
WITH RECOVERY, STOPAT = '2024-03-15 14:30:00';

Partial Database Recovery:

-- Restore specific filegroups
RESTORE DATABASE datawarehouse FILEGROUP = 'FACTS_2024'
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH PARTIAL, NORECOVERY;

RESTORE DATABASE datawarehouse FILEGROUP = 'DIMENSIONS'
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH NORECOVERY;

4. Backup Validation:

-- Verify backup integrity
RESTORE VERIFYONLY
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH CHECKSUM;

-- Test restore to verify recoverability
RESTORE DATABASE datawarehouse_test
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH MOVE 'DW_Data' TO 'C:\Test\DW_Test.mdf',
MOVE 'DW_Log' TO 'C:\Test\DW_Test.ldf',
REPLACE;

5. Cloud Backup Integration:

#!/bin/bash
# Cloud backup integration

# Backup to Azure Blob Storage
sqlcmd -Q "BACKUP DATABASE datawarehouse
TO URL = 'https://account.blob.core.windows.net/backups/DW_$(date +%Y%m%d).bak'
WITH CREDENTIAL = 'AzureCredential', COMPRESSION;"

# Backup to AWS S3
aws s3 cp /backup/DW_Full_20240315.bak s3://dw-backups/$(date +%Y/%m/%d)/

43. Change Management in Data Warehouse

Answer: Change Management Framework:

1. Change Categories:

• Schema Changes: New tables, columns, indexes

• ETL Changes: Mapping modifications, new data sources

• Business Logic Changes: Calculation rules, transformations

• Infrastructure Changes: Hardware, software upgrades

2. Change Control Process:

-- Change tracking table
CREATE TABLE dw_change_log (
    change_id NUMBER PRIMARY KEY,
    change_type VARCHAR2(50),
    object_name VARCHAR2(100),
    change_description CLOB,
    requested_by VARCHAR2(50),
    approved_by VARCHAR2(50),
    implemented_by VARCHAR2(50),
    request_date DATE,
    approval_date DATE,
    implementation_date DATE,
    rollback_plan CLOB,
    status VARCHAR2(20),
    environment VARCHAR2(20)
);

3. Version Control Integration:

#!/bin/bash
# Database version control

# Schema versioning
git add database/schemas/
git commit -m "Schema change: Added customer_preference table"
git tag -a v2.1.0 -m "Release 2.1.0 - Customer preferences"

# ETL code versioning
git add informatica/mappings/
git commit -m "ETL change: Updated customer transformation logic"

# Deployment script versioning
git add deployment/scripts/
git commit -m "Deployment: Added migration script for v2.1.0"

4. Environment Promotion:

-- Development to Test promotion
BEGIN
    -- Validate changes in development
    IF validate_schema_changes() = 'PASS' THEN
        -- Deploy to test environment
        EXECUTE deploy_to_test();

        -- Run regression tests
        IF run_regression_tests() = 'PASS' THEN
            UPDATE dw_change_log
            SET status = 'READY_FOR_PROD'
            WHERE change_id = :change_id;
        END IF;
    END IF;
END;

5. Rollback Procedures:

-- Automated rollback capability
CREATE OR REPLACE PROCEDURE rollback_change(p_change_id NUMBER) AS
    v_rollback_script CLOB;
BEGIN
    SELECT rollback_plan INTO v_rollback_script
    FROM dw_change_log
    WHERE change_id = p_change_id;

    -- Execute rollback
    EXECUTE IMMEDIATE v_rollback_script;

    -- Log rollback
    UPDATE dw_change_log
    SET status = 'ROLLED_BACK',
        rollback_date = SYSDATE
    WHERE change_id = p_change_id;
END;

6. Impact Analysis:

-- Data lineage impact analysis
WITH RECURSIVE impact_analysis AS (
    SELECT source_object, target_object, 1 as level
    FROM data_lineage
    WHERE source_object = 'CHANGED_TABLE'

    UNION ALL

    SELECT dl.source_object, dl.target_object, ia.level + 1
    FROM data_lineage dl
    JOIN impact_analysis ia ON dl.source_object = ia.target_object
    WHERE ia.level < 5
)
SELECT target_object, level
FROM impact_analysis
ORDER BY level;

44. Performance Monitoring and Tuning

Answer: Performance Monitoring Framework:

1. Key Performance Indicators:

-- ETL performance tracking
CREATE TABLE etl_performance_metrics (
    execution_id NUMBER,
    workflow_name VARCHAR2(100),
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    duration_minutes NUMBER,
    records_processed NUMBER,
    throughput_per_minute NUMBER,
    cpu_utilization_pct NUMBER,
    memory_usage_mb NUMBER,
    io_read_mb NUMBER,
    io_write_mb NUMBER,
    error_count NUMBER
);

-- Query performance tracking
CREATE TABLE query_performance_log (
    query_id NUMBER,
    sql_text CLOB,
   execution_time_ms NUMBER,
    logical_reads NUMBER,
    physical_reads NUMBER,
    cpu_time_ms NUMBER,
    execution_plan_hash NUMBER,
    execution_date TIMESTAMP
);

2. Automated Performance Monitoring:

#!/bin/bash
# Performance monitoring script

monitor_dw_performance() {
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')

    # Database performance metrics
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    local memory_usage=$(free | grep Mem | awk '{printf("%.2f", $3/$2 * 100.0)}')
    local disk_io=$(iostat -x 1 2 | tail -1 | awk '{print $4, $5}')

    # ETL process monitoring
    local running_workflows=$(ps aux | grep -c informatica)
    local failed_sessions=$(grep -c "FAILED" /logs/informatica/*.log)

    # Database connection monitoring
    local active_connections=$(netstat -an | grep :1521 | grep ESTABLISHED | wc -l)

    # Log metrics
    echo "$timestamp,$cpu_usage,$memory_usage,$disk_io,$running_workflows,$failed_sessions,$active_connections" \
         >> /logs/performance_metrics.csv

    # Alert on thresholds
    if (( $(echo "$cpu_usage > 80" | bc -l) )); then
        send_alert "High CPU usage: $cpu_usage%"
    fi

    if [ "$failed_sessions" -gt 0 ]; then
        send_alert "ETL failures detected: $failed_sessions failed sessions"
    fi
}

3. Query Performance Analysis:

-- Identify slow queries
SELECT
    sql_text,
    execution_count,
    total_elapsed_time,
    avg_elapsed_time,
    total_logical_reads,
    avg_logical_reads
FROM (
    SELECT
        sql_text,
        COUNT(*) as execution_count,
        SUM(execution_time_ms) as total_elapsed_time,
        AVG(execution_time_ms) as avg_elapsed_time,
        SUM(logical_reads) as total_logical_reads,
        AVG(logical_reads) as avg_logical_reads,
        RANK() OVER (ORDER BY AVG(execution_time_ms) DESC) as performance_rank
    FROM query_performance_log
    WHERE execution_date >= TRUNC(SYSDATE) - 7
    GROUP BY sql_text
)
WHERE performance_rank <= 10;

4. Index Usage Analysis:

-- Index effectiveness analysis
SELECT
    i.index_name,
    i.table_name,
    s.user_seeks,
    s.user_scans,
    s.user_lookups,
    s.user_updates,
    s.last_user_seek,
    s.last_user_scan,
    CASE
        WHEN s.user_seeks + s.user_scans + s.user_lookups = 0
        THEN 'UNUSED'
        WHEN s.user_updates > (s.user_seeks + s.user_scans + s.user_lookups) * 10
        THEN 'OVER_MAINTAINED'
        ELSE 'ACTIVE'
    END as index_status
FROM sys.indexes i
LEFT JOIN sys.dm_db_index_usage_stats s
    ON i.object_id = s.object_id AND i.index_id = s.index_id
WHERE i.index_id > 0;

5. Performance Tuning Recommendations:

-- Automatic tuning recommendations
WITH performance_analysis AS (
    SELECT
        table_name,
        AVG(scan_time_ms) as avg_scan_time,
        SUM(scan_count) as total_scans,
        MAX(last_scan_date) as last_activity
    FROM table_scan_stats
    GROUP BY table_name
)
SELECT
    table_name,
    CASE
        WHEN avg_scan_time > 1000 AND total_scans > 100
        THEN 'Consider partitioning or indexing'
        WHEN total_scans = 0 AND last_activity < SYSDATE - 90
        THEN 'Consider archiving - unused for 90+ days'
        WHEN avg_scan_time > 5000
        THEN 'Requires immediate optimization'
        ELSE 'Performance acceptable'
    END as recommendation
FROM performance_analysis;

45. Data Retention and Archival Strategies

Answer: Data Lifecycle Management:

1. Retention Policy Framework:

-- Data retention policies
CREATE TABLE data_retention_policies (
    policy_id NUMBER PRIMARY KEY,
    table_name VARCHAR2(100),
    retention_period_months NUMBER,
    archive_criteria VARCHAR2(500),
    purge_criteria VARCHAR2(500),
    compliance_requirement VARCHAR2(100),
    policy_effective_date DATE,
    created_by VARCHAR2(50)
);

-- Example policies
INSERT INTO data_retention_policies VALUES
(1, 'SALES_FACT', 84, 'age > 36 months', 'age > 84 months', 'SOX', DATE '2024-01-01', 'DBA_TEAM');

INSERT INTO data_retention_policies VALUES
(2, 'CUSTOMER_INTERACTIONS', 36, 'age > 12 months', 'age > 36 months', 'GDPR', DATE '2024-01-01', 'DBA_TEAM');

2. Automated Archival Process:

-- Archive procedure
CREATE OR REPLACE PROCEDURE archive_old_data(p_table_name VARCHAR2) AS
    v_archive_criteria VARCHAR2(500);
    v_archive_table VARCHAR2(100);
    v_records_archived NUMBER;
BEGIN
    -- Get archival criteria
    SELECT archive_criteria INTO v_archive_criteria
    FROM data_retention_policies
    WHERE table_name = p_table_name;

    v_archive_table := p_table_name || '_ARCHIVE';

    -- Create archive table if not exists
    EXECUTE IMMEDIATE 'CREATE TABLE ' || v_archive_table ||
                     ' AS SELECT * FROM ' || p_table_name || ' WHERE 1=0';

    -- Move data to archive
    EXECUTE IMMEDIATE
        'INSERT INTO ' || v_archive_table ||
        ' SELECT * FROM ' || p_table_name ||
        ' WHERE ' || v_archive_criteria;

    v_records_archived := SQL%ROWCOUNT;

    -- Delete archived data from main table
    EXECUTE IMMEDIATE
        'DELETE FROM ' || p_table_name ||
        ' WHERE ' || v_archive_criteria;

    -- Log archival activity
    INSERT INTO archival_log VALUES (
        SYSDATE, p_table_name, v_records_archived, 'SUCCESS'
    );

    COMMIT;
END;

3. Hierarchical Storage Management:

#!/bin/bash
# Tiered storage implementation

# Tier 1: High-performance SSD (current data)
# Tier 2: Standard SSD (recent data - 1 year)
# Tier 3: HDD storage (archive data - 1-7 years)
# Tier 4: Tape/Cloud storage (compliance data - 7+ years)

manage_storage_tiers() {
    local current_date=$(date +%Y-%m-%d)

    # Move data older than 1 year to Tier 2
    find /data/tier1 -name "*.dat" -mtime +365 -exec mv {} /data/tier2/ \;

    # Move data older than 3 years to Tier 3
    find /data/tier2 -name "*.dat" -mtime +1095 -exec mv {} /data/tier3/ \;

    # Compress and move to Tier 4
    find /data/tier3 -name "*.dat" -mtime +2555 | while read file; do
        gzip "$file"
        aws s3 cp "${file}.gz" s3://dw-archive-bucket/$(date +%Y/%m/%d)/
        rm "${file}.gz"
    done
}

4. Data Compression Strategies:

-- Table compression for archival
ALTER TABLE sales_fact_2020 COMPRESS FOR ARCHIVE HIGH;
ALTER TABLE sales_fact_2021 COMPRESS FOR QUERY HIGH;
ALTER TABLE sales_fact_2022 COMPRESS FOR OLTP;

-- Partition-level compression
ALTER TABLE sales_fact
MODIFY PARTITION sales_2020_q1 COMPRESS FOR ARCHIVE HIGH;

-- Columnstore compression for analytics
CREATE COLUMNSTORE INDEX ix_sales_fact_archive
ON sales_fact_archive (date_key, customer_key, product_key, sales_amount);

5. Compliance and Legal Hold:

-- Legal hold management
CREATE TABLE legal_holds (
    hold_id NUMBER PRIMARY KEY,
    case_number VARCHAR2(50),
    table_name VARCHAR2(100),
    hold_criteria VARCHAR2(500),
    hold_start_date DATE,
    hold_end_date DATE,
    status VARCHAR2(20)
);

-- Compliance reporting
CREATE OR REPLACE FUNCTION check_retention_compliance
RETURN VARCHAR2 AS
    v_violations NUMBER := 0;
    v_status VARCHAR2(20);
BEGIN
    -- Check for data exceeding retention periods
    SELECT COUNT(*) INTO v_violations
    FROM (
        SELECT p.table_name, p.retention_period_months
        FROM data_retention_policies p
        WHERE EXISTS (
            SELECT 1 FROM user_tables t
            WHERE t.table_name = p.table_name
            AND get_table_oldest_record_age(t.table_name) > p.retention_period_months
        )
   );

    IF v_violations = 0 THEN
        v_status := 'COMPLIANT';
    ELSE
        v_status := 'VIOLATIONS_FOUND';
    END IF;

    RETURN v_status;
END;

46. Disaster Recovery Planning

Answer: DR Planning Framework:

1. Business Impact Analysis:

-- Critical system classification
CREATE TABLE dr_system_classification (
    system_id NUMBER PRIMARY KEY,
    system_name VARCHAR2(100),
    criticality_level VARCHAR2(20), -- CRITICAL, HIGH, MEDIUM, LOW
    rto_hours NUMBER,              -- Recovery Time Objective
    rpo_hours NUMBER,              -- Recovery Point Objective
    business_impact_cost NUMBER,   -- Cost per hour of downtime
    dependencies VARCHAR2(500),
    recovery_priority NUMBER
);

-- Example classifications
INSERT INTO dr_system_classification VALUES
(1, 'SALES_DW', 'CRITICAL', 4, 1, 50000, 'CRM, ERP, Reporting', 1);

INSERT INTO dr_system_classification VALUES
(2, 'FINANCIAL_DW', 'CRITICAL', 2, 0.5, 100000, 'GL, AP, AR', 1);

2. DR Site Configuration:

#!/bin/bash
# DR site setup and synchronization

setup_dr_environment() {
    local primary_site="$1"
    local dr_site="$2"

    echo "Setting up DR environment..."

    # Database replication setup
    setup_database_replication "$primary_site" "$dr_site"

    # Application server configuration
    setup_application_servers "$dr_site"

    # Network configuration
    configure_dr_network "$dr_site"

    # Storage replication
    setup_storage_replication "$primary_site" "$dr_site"

    echo "DR environment setup completed"
}

setup_database_replication() {
    local primary="$1"
    local standby="$2"

    # Oracle Data Guard setup
    sqlplus sys/password@$primary as sysdba << EOF
    ALTER DATABASE ADD STANDBY LOGFILE GROUP 10 SIZE 100M;
    ALTER DATABASE ADD STANDBY LOGFILE GROUP 11 SIZE 100M;
    ALTER DATABASE FORCE LOGGING;

    -- Create standby controlfile
    ALTER DATABASE CREATE STANDBY CONTROLFILE AS '/backup/standby.ctl';
EOF

    # Copy standby controlfile to DR site
    scp /backup/standby.ctl oracle@$standby:/u01/app/oracle/oradata/

    # Start managed recovery
    sqlplus sys/password@$standby as sysdba << EOF
    STARTUP NOMOUNT;
    ALTER DATABASE MOUNT STANDBY DATABASE;
    ALTER DATABASE RECOVER MANAGED STANDBY DATABASE
    USING CURRENT LOGFILE DISCONNECT FROM SESSION;
EOF
}

3. Failover Procedures:

-- Automated failover script
CREATE OR REPLACE PROCEDURE execute_failover AS
    v_lag_minutes NUMBER;
    v_primary_status VARCHAR2(20);
BEGIN
    -- Check primary site status
    v_primary_status := check_primary_availability();

    IF v_primary_status = 'UNAVAILABLE' THEN
        -- Log failover initiation
        INSERT INTO dr_log VALUES (
            SYSDATE, 'FAILOVER_INITIATED', 'Primary site unavailable'
        );

        -- Stop log apply on standby
        ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

        -- Activate standby database
        ALTER DATABASE ACTIVATE STANDBY DATABASE;

        -- Open database for read/write
        SHUTDOWN IMMEDIATE;
        STARTUP;

        -- Update application connection strings
        update_connection_strings();

        -- Notify operations team
        send_notification('DR_FAILOVER_COMPLETED');

        -- Log completion
        INSERT INTO dr_log VALUES (
            SYSDATE, 'FAILOVER_COMPLETED', 'Database activated successfully'
        );
    END IF;
END;

4. DR Testing Procedures:

#!/bin/bash
# DR testing framework

conduct_dr_test() {
    local test_type="$1" # PLANNED, UNPLANNED, PARTIAL
    local test_id="DR_TEST_$(date +%Y%m%d_%H%M%S)"

    echo "Starting DR test: $test_id ($test_type)"

    # Pre-test validations
    validate_dr_environment

    case "$test_type" in
        "PLANNED")
            planned_failover_test "$test_id"
            ;;
        "UNPLANNED")
            unplanned_failover_test "$test_id"
            ;;
        "PARTIAL")
            partial_system_test "$test_id"
            ;;
    esac

    # Post-test cleanup
    cleanup_test_environment "$test_id"

    # Generate test report
    generate_test_report "$test_id"
}

planned_failover_test() {
    local test_id="$1"

    # Step 1: Verify replication lag
    local lag=$(check_replication_lag)
    log_test_step "$test_id" "Replication lag check: $lag minutes"

    # Step 2: Perform controlled failover
    execute_controlled_failover
    log_test_step "$test_id" "Controlled failover executed"

    # Step 3: Validate application connectivity
    test_application_connectivity
    log_test_step "$test_id" "Application connectivity tested"

    # Step 4: Run data validation tests
    run_data_validation_tests
    log_test_step "$test_id" "Data validation completed"

    # Step 5: Failback to primary
    execute_failback_procedure
    log_test_step "$test_id" "Failback to primary completed"
}

5. Recovery Monitoring:

-- DR monitoring dashboard
CREATE VIEW dr_status_dashboard AS
SELECT
    'Primary Database' as component,
    CASE WHEN database_status = 'OPEN' THEN 'AVAILABLE' ELSE 'UNAVAILABLE' END as status,
    last_check_time
FROM primary_db_status
UNION ALL
SELECT
    'Standby Database' as component,
    CASE WHEN apply_lag < 5 THEN 'SYNCHRONIZED' ELSE 'LAG_DETECTED' END as status,
    last_check_time
FROM standby_db_status
UNION ALL
SELECT
    'Network Connectivity' as component,
    CASE WHEN ping_response < 100 THEN 'OPTIMAL' ELSE 'DEGRADED' END as status,
    last_check_time
FROM network_status;

47. Data Warehouse Documentation

Answer: Documentation Framework:

1. Automated Metadata Documentation:

-- Comprehensive metadata repository
CREATE TABLE dw_documentation (
    doc_id NUMBER PRIMARY KEY,
    object_type VARCHAR2(50),      -- TABLE, VIEW, PROCEDURE, MAPPING
    object_name VARCHAR2(100),
    documentation_type VARCHAR2(50), -- DESCRIPTION, BUSINESS_RULES, TECHNICAL_SPECS
    content CLOB,
    created_by VARCHAR2(50),
    created_date DATE,
    last_updated DATE,
    version VARCHAR2(10)
);

-- Auto-generate table documentation
CREATE OR REPLACE PROCEDURE generate_table_documentation AS
BEGIN
    FOR table_rec IN (
        SELECT table_name, comments
        FROM user_tab_comments
        WHERE table_name LIKE '%_FACT' OR table_name LIKE '%_DIM'
    ) LOOP
        -- Generate table overview
        INSERT INTO dw_documentation VALUES (
            doc_seq.NEXTVAL,
            'TABLE',
            table_rec.table_name,
            'OVERVIEW',
            generate_table_overview(table_rec.table_name),
            USER,
            SYSDATE,
            SYSDATE,
            '1.0'
        );

        -- Generate column specifications
        INSERT INTO dw_documentation VALUES (
            doc_seq.NEXTVAL,
            'TABLE',
            table_rec.table_name,
            'COLUMN_SPECS',
            generate_column_documentation(table_rec.table_name),
            USER,
            SYSDATE,
            SYSDATE,
            '1.0'
        );
    END LOOP;
END;

2. Data Lineage Documentation:

-- Data lineage tracking
CREATE TABLE data_lineage_documentation (
    lineage_id NUMBER PRIMARY KEY,
    source_system VARCHAR2(100),
    source_object VARCHAR2(100),
    transformation_logic CLOB,
    target_object VARCHAR2(100),
    business_rules CLOB,
    data_quality_rules CLOB,
    created_by VARCHAR2(50),
    documentation_date DATE
);

-- Generate lineage documentation
CREATE OR REPLACE FUNCTION generate_lineage_doc(p_target_table VARCHAR2)
RETURN CLOB AS
    v_lineage_doc CLOB;
BEGIN
    SELECT
        'SOURCE: ' || source_system || CHR(10) ||
        'TRANSFORMATION: ' || transformation_logic || CHR(10) ||
        'BUSINESS RULES: ' || business_rules || CHR(10) ||
        'DATA QUALITY: ' || data_quality_rules
    INTO v_lineage_doc
    FROM data_lineage_documentation
    WHERE target_object = p_target_table;

    RETURN v_lineage_doc;
END;

3. Business Glossary:

-- Business terminology management
CREATE TABLE business_glossary (
    term_id NUMBER PRIMARY KEY,
    business_term VARCHAR2(100),
    definition CLOB,
    business_context VARCHAR2(200),
    technical_implementation VARCHAR2(200),
    related_tables VARCHAR2(500),
    business_owner VARCHAR2(50),
    technical_owner VARCHAR2(50),
    approval_status VARCHAR2(20),
    effective_date DATE
);

-- Example entries
INSERT INTO business_glossary VALUES (
    1,
    'Customer Lifetime Value',
    'Total predicted revenue from a customer over their entire relationship',
    'Marketing and Sales Analytics',
    'Calculated in customer_metrics_fact table using predictive algorithms',
    'customer_dim, sales_fact, customer_metrics_fact',
    'VP Marketing',
    'Data Team Lead',
    'APPROVED',
    DATE '2024-01-01'
);

4. Technical Documentation Generation:

#!/bin/bash
# Automated documentation generation

generate_technical_docs() {
    local output_dir="/docs/datawarehouse"
    local timestamp=$(date +%Y%m%d)

    mkdir -p "$output_dir"

    # Database schema documentation
    generate_schema_docs "$output_dir"

    # ETL process documentation
    generate_etl_docs "$output_dir"

    # Data quality documentation
    generate_dq_docs "$output_dir"

    # Performance metrics documentation
    generate_performance_docs "$output_dir"

    # Create master document
    create_master_document "$output_dir" "$timestamp"

    echo "Documentation generated in $output_dir"
}

generate_schema_docs() {
    local output_dir="$1"

    # Extract schema information
    sqlplus -s user/pass@db << 'EOF' > "$output_dir/schema_documentation.html"
SET PAGESIZE 0
SET MARKUP HTML ON
SELECT
    '<h2>Table: ' || table_name || '</h2>' ||
    '<p>Comments: ' || NVL(comments, 'No comments') || '</p>' ||
    '<table border="1">' ||
    '<tr><th>Column</th><th>Data Type</th><th>Nullable</th><th>Comments</th></tr>'
FROM user_tab_comments
WHERE table_name LIKE '%_FACT' OR table_name LIKE '%_DIM';

SELECT
    '<tr><td>' || column_name || '</td>' ||
    '<td>' || data_type ||
    CASE WHEN data_length IS NOT NULL THEN '(' || data_length || ')' END ||
    '</td><td>' || nullable || '</td>' ||
    '<td>' || NVL(comments, '') || '</td></tr>'
FROM user_col_comments c
JOIN user_tab_columns t ON c.table_name = t.table_name AND c.column_name = t.column_name
WHERE c.table_name LIKE '%_FACT' OR c.table_name LIKE '%_DIM'
ORDER BY c.table_name, t.column_id;

SELECT '</table><br/>';
EOF
}

5. Documentation Portal:


<!DOCTYPE html>
<html>
<head>
    <title>Data Warehouse Documentation</title>
    <style>
        .nav-menu { background: #f0f0f0; padding: 10px; }
        .content-area { padding: 20px; }
        .metadata-table { border-collapse: collapse; width: 100%; }
        .metadata-table th, .metadata-table td { border: 1px solid #ddd; padding: 8px; }
    </style>
</head>
<body>
    <div class="nav-menu">
        <a href="#schema">Schema Documentation</a> |
        <a href="#etl">ETL Processes</a> |
        <a href="#glossary">Business Glossary</a> |
        <a href="#lineage">Data Lineage</a> |
        <a href="#quality">Data Quality</a>
    </div>

    <div class="content-area">
        <h1>Enterprise Data Warehouse Documentation</h1>
        <p>Last Updated: <span id="lastUpdated">{{TIMESTAMP}}</span></p>

        <div id="schema">
            <h2>Schema Documentation</h2>
            {{SCHEMA_CONTENT}}
        </div>

        <div id="etl">
            <h2>ETL Process Documentation</h2>
            {{ETL_CONTENT}}
        </div>

        <div id="glossary">
            <h2>Business Glossary</h2>
            {{GLOSSARY_CONTENT}}
        </div>
    </div>
</body>
</html>

48. Capacity Planning and Scaling

Answer: Capacity Planning Framework:

1. Growth Analysis and Forecasting:

-- Historical growth analysis
CREATE VIEW data_growth_analysis AS
SELECT
    table_name,
    EXTRACT(YEAR FROM sample_date) as year,
    EXTRACT(MONTH FROM sample_date) as month,
    row_count,
    size_mb,
    LAG(row_count) OVER (PARTITION BY table_name ORDER BY sample_date) as prev_row_count,
    LAG(size_mb) OVER (PARTITION BY table_name ORDER BY sample_date) as prev_size_mb,
    (row_count - LAG(row_count) OVER (PARTITION BY table_name ORDER BY sample_date)) /
    LAG(row_count) OVER (PARTITION BY table_name ORDER BY sample_date) * 100 as growth_rate_pct
FROM table_size_history
WHERE sample_date >= ADD_MONTHS(SYSDATE, -24);

-- Predictive growth modeling
WITH monthly_growth AS (
    SELECT
        table_name,
        AVG(growth_rate_pct) as avg_monthly_growth,
        STDDEV(growth_rate_pct) as growth_volatility
    FROM data_growth_analysis
    WHERE growth_rate_pct IS NOT NULL
    GROUP BY table_name
),
current_size AS (
    SELECT
        table_name,
        row_count as current_rows,
        size_mb as current_size_mb
    FROM table_size_history
    WHERE sample_date = (SELECT MAX(sample_date) FROM table_size_history)
)
SELECT
    c.table_name,
    c.current_size_mb,
    ROUND(c.current_size_mb * POWER(1 + g.avg_monthly_growth/100, 12), 2) as projected_size_12_months,
    ROUND(c.current_size_mb * POWER(1 + g.avg_monthly_growth/100, 24), 2) as projected_size_24_months,
    CASE
        WHEN g.avg_monthly_growth > 10 THEN 'HIGH_GROWTH'
        WHEN g.avg_monthly_growth > 5 THEN 'MODERATE_GROWTH'
        ELSE 'STABLE'
    END as growth_classification
FROM current_size c
JOIN monthly_growth g ON c.table_name = g.table_name;

2. Resource Utilization Monitoring:

#!/bin/bash
# Comprehensive resource monitoring

monitor_system_resources() {
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    local log_file="/logs/capacity_monitoring.csv"

    # CPU utilization
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)

    # Memory utilization
    local memory_total=$(free -m | awk 'NR==2{printf "%.0f", $2}')
    local memory_used=$(free -m | awk 'NR==2{printf "%.0f", $3}')
    local memory_pct=$(echo "scale=2; $memory_used * 100 / $memory_total" | bc)

    # Disk utilization
    local disk_usage=$(df -h / | awk 'NR==2{print $5}' | cut -d'%' -f1)

    # Database connections
    local db_connections=$(netstat -an | grep :1521 | grep ESTABLISHED | wc -l)

    # ETL process count
    local etl_processes=$(ps aux | grep -c informatica)

    # Network throughput
    local network_rx=$(cat /proc/net/dev | grep eth0 | awk '{print $2}')
    local network_tx=$(cat /proc/net/dev | grep eth0 | awk '{print $10}')

    # Log all metrics
    echo "$timestamp,$cpu_usage,$memory_pct,$disk_usage,$db_connections,$etl_processes,$network_rx,$network_tx" >> "$log_file"

    # Check thresholds and alert
    check_capacity_thresholds "$cpu_usage" "$memory_pct" "$disk_usage"
}

check_capacity_thresholds() {
    local cpu="$1"
    local memory="$2"
    local disk="$3"

    # Define thresholds
    local cpu_threshold=80
    local memory_threshold=85
    local disk_threshold=85

    # Check and alert
    if (( $(echo "$cpu > $cpu_threshold" | bc -l) )); then
        send_capacity_alert "CPU" "$cpu" "$cpu_threshold"
    fi

    if (( $(echo "$memory > $memory_threshold" | bc -l) )); then
        send_capacity_alert "Memory" "$memory" "$memory_threshold"
    fi

    if (( $(echo "$disk > $disk_threshold" | bc -l) )); then
        send_capacity_alert "Disk" "$disk" "$disk_threshold"
    fi
}

3. Scaling Strategies:

-- Horizontal scaling implementation
CREATE OR REPLACE PROCEDURE implement_horizontal_scaling AS
BEGIN
    -- Partition large tables by date
    FOR table_rec IN (
        SELECT table_name, estimated_size_gb
        FROM capacity_analysis
        WHERE estimated_size_gb > 100
        AND partitioning_status = 'NOT_PARTITIONED'
    ) LOOP
        -- Create partitioned version
        create_partitioned_table(table_rec.table_name);

        -- Migrate data
        migrate_to_partitioned_table(table_rec.table_name);

        -- Update statistics
        gather_partition_statistics(table_rec.table_name);
    END LOOP;
END;

-- Vertical scaling assessment
CREATE OR REPLACE FUNCTION assess_vertical_scaling_needs
RETURN VARCHAR2 AS
    v_recommendation VARCHAR2(1000);
    v_cpu_avg NUMBER;
    v_memory_avg NUMBER;
    v_io_wait_avg NUMBER;
BEGIN
    -- Analyze resource utilization trends
    SELECT
        AVG(cpu_utilization),
        AVG(memory_utilization),
        AVG(io_wait_pct)
    INTO v_cpu_avg, v_memory_avg, v_io_wait_avg
    FROM system_performance_log
    WHERE log_date >= SYSDATE - 30;

    -- Generate recommendations
    IF v_cpu_avg > 80 THEN
        v_recommendation := v_recommendation || 'CPU upgrade recommended. ';
    END IF;

    IF v_memory_avg > 85 THEN
        v_recommendation := v_recommendation || 'Memory expansion needed. ';
    END IF;

    IF v_io_wait_avg > 20 THEN
        v_recommendation := v_recommendation || 'Storage performance upgrade required. ';
    END IF;

    RETURN NVL(v_recommendation, 'Current capacity adequate');
END;

4. Cloud Scaling Automation:

#!/bin/bash
# Cloud auto-scaling implementation

implement_cloud_autoscaling() {
    local cloud_provider="$1" # AWS, Azure, GCP

    case "$cloud_provider" in
        "AWS")
            setup_aws_autoscaling
            ;;
        "Azure")
            setup_azure_autoscaling
            ;;
        "GCP")
            setup_gcp_autoscaling
            ;;
    esac
}

setup_aws_autoscaling() {
    # Create launch template
    aws ec2 create-launch-template \
        --launch-template-name dw-template \
        --launch-template-data '{
            "ImageId": "ami-12345678",
            "InstanceType": "r5.xlarge",
            "SecurityGroupIds": ["sg-12345678"],
            "UserData": "'"$(base64 < /scripts/dw-setup.sh)"'"
        }'

    # Create auto scaling group
    aws autoscaling create-auto-scaling-group \
        --auto-scaling-group-name dw-asg \
        --launch-template "LaunchTemplateName=dw-template,Version=1" \
        --min-size 2 \
        --max-size 10 \
        --desired-capacity 4 \
        --target-group-arns "arn:aws:elasticloadbalancing:region:account:targetgroup/dw-targets"

    # Create scaling policies
    aws autoscaling put-scaling-policy \
        --auto-scaling-group-name dw-asg \
        --policy-name scale-up \
        --policy-type TargetTrackingScaling \
        --target-tracking-configuration '{
            "TargetValue": 70.0,
            "PredefinedMetricSpecification": {
                "PredefinedMetricType": "ASGAverageCPUUtilization"
            }
        }'
}

5. Capacity Planning Dashboard:

-- Capacity planning metrics view
CREATE VIEW capacity_planning_dashboard AS
SELECT
    'Current Storage Usage' as metric,
    ROUND(SUM(size_gb), 2) as current_value,
    'GB' as unit,
    ROUND(SUM(projected_size_12m), 2) as projected_12m,
    ROUND((SUM(projected_size_12m) - SUM(size_gb)) / SUM(size_gb) * 100, 1) as growth_pct
FROM table_capacity_analysis
UNION ALL
SELECT
    'Database Connections',
    AVG(connection_count),
    'connections',
    AVG(connection_count) * 1.5,
    50.0
FROM connection_usage_stats
WHERE sample_date >= SYSDATE - 7
UNION ALL
SELECT
    'ETL Processing Time',
    AVG(duration_hours),
    'hours',
    AVG(duration_hours) * 1.2,
    20.0
FROM etl_performance_stats
WHERE execution_date >= SYSDATE - 30;

49. Data Warehouse Migration Strategies

Answer: Migration Planning Framework:

1. Migration Assessment:

-- Source system analysis
CREATE TABLE migration_assessment (
    assessment_id NUMBER PRIMARY KEY,
    source_system VARCHAR2(100),
    object_type VARCHAR2(50),
    object_name VARCHAR2(100),
    object_size_gb NUMBER,
    complexity_score NUMBER,    -- 1-10 scale
    business_criticality VARCHAR2(20),
    migration_priority NUMBER,
    estimated_effort_hours NUMBER,
    dependencies VARCHAR2(500),
    risk_factors VARCHAR2(500)
);

-- Complexity scoring
CREATE OR REPLACE FUNCTION calculate_migration_complexity(
    p_object_name VARCHAR2,
    p_object_type VARCHAR2
) RETURN NUMBER AS
    v_complexity NUMBER := 1;
    v_column_count NUMBER;
    v_index_count NUMBER;
    v_constraint_count NUMBER;
BEGIN
    IF p_object_type = 'TABLE' THEN
        -- Factor in table characteristics
        SELECT COUNT(*) INTO v_column_count
        FROM user_tab_columns WHERE table_name = p_object_name;

        SELECT COUNT(*) INTO v_index_count
        FROM user_indexes WHERE table_name = p_object_name;

        SELECT COUNT(*) INTO v_constraint_count
        FROM user_constraints WHERE table_name = p_object_name;

        -- Calculate complexity score
        v_complexity := LEAST(10,
            1 +
            CASE WHEN v_column_count > 50 THEN 2 ELSE 0 END +
            CASE WHEN v_index_count > 10 THEN 2 ELSE 0 END +
            CASE WHEN v_constraint_count > 5 THEN 2 ELSE 0 END
        );
    END IF;

    RETURN v_complexity;
END;

2. Migration Strategies:

Big Bang Migration:

#!/bin/bash
# Complete cutover migration

execute_bigbang_migration() {
    local cutover_date="$1"
    local source_env="$2"
    local target_env="$3"

    echo "Starting Big Bang migration on $cutover_date"

    # Phase 1: Final data sync
    echo "Phase 1: Final data synchronization"
    sync_final_data "$source_env" "$target_env"

    # Phase 2: Switch off source systems
    echo "Phase 2: Source system shutdown"
    shutdown_source_systems "$source_env"

    # Phase 3: Final validation
    echo "Phase 3: Data validation"
    validate_migrated_data "$target_env"

    # Phase 4: Cutover to target
    echo "Phase 4: Application cutover"
    cutover_applications "$target_env"

    # Phase 5: Go-live verification
    echo "Phase 5: Go-live verification"
    verify_golive_success "$target_env"

    echo "Big Bang migration completed"
}

Phased Migration:

#!/bin/bash
# Gradual migration approach

execute_phased_migration() {
    local migration_phases=(
        "reference_data:PHASE1"
        "historical_data:PHASE2"
        "current_transactions:PHASE3"
        "real_time_processes:PHASE4"
    )

    for phase in "${migration_phases[@]}"; do
        local data_type=$(echo "$phase" | cut -d':' -f1)
        local phase_name=$(echo "$phase" | cut -d':' -f2)

        echo "Starting $phase_name: $data_type migration"

        # Execute phase-specific migration
        migrate_data_category "$data_type"

        # Validate phase completion
        validate_phase_completion "$data_type"

        # User acceptance testing
        conduct_uat "$phase_name"

        echo "$phase_name completed successfully"
        sleep 3600 # 1 hour between phases
    done
}

3. Data Migration Tools:

-- Migration monitoring and control
CREATE TABLE migration_control (
    migration_id NUMBER PRIMARY KEY,
    source_object VARCHAR2(100),
    target_object VARCHAR2(100),
    migration_method VARCHAR2(50), -- BULK_COPY, INCREMENTAL, STREAMING
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    records_migrated NUMBER,
    migration_status VARCHAR2(20),
    error_count NUMBER,
    validation_status VARCHAR2(20)
);

-- Migration execution procedure
CREATE OR REPLACE PROCEDURE execute_table_migration(
    p_source_table VARCHAR2,
    p_target_table VARCHAR2,
    p_migration_method VARCHAR2
) AS
    v_migration_id NUMBER;
    v_start_time TIMESTAMP := SYSTIMESTAMP;
    v_records_migrated NUMBER;
BEGIN
    -- Initialize migration tracking
    INSERT INTO migration_control (
        migration_id, source_object, target_object,
        migration_method, start_time, migration_status
    ) VALUES (
        migration_seq.NEXTVAL, p_source_table, p_target_table,
        p_migration_method, v_start_time, 'RUNNING'
    ) RETURNING migration_id INTO v_migration_id;

    -- Execute migration based on method
    CASE p_migration_method
        WHEN 'BULK_COPY' THEN
            execute_bulk_copy(p_source_table, p_target_table);
        WHEN 'INCREMENTAL' THEN
            execute_incremental_copy(p_source_table, p_target_table);
        WHEN 'STREAMING' THEN
            execute_streaming_copy(p_source_table, p_target_table);
    END CASE;

    -- Update completion status
    UPDATE migration_control
    SET end_time = SYSTIMESTAMP,
        migration_status = 'COMPLETED',
        records_migrated = v_records_migrated
    WHERE migration_id = v_migration_id;

EXCEPTION
    WHEN OTHERS THEN
        UPDATE migration_control
        SET end_time = SYSTIMESTAMP,
            migration_status = 'FAILED',
            error_count = 1
        WHERE migration_id = v_migration_id;
        RAISE;
END;

4. Cloud Migration:

#!/bin/bash
# Cloud-specific migration procedures

migrate_to_cloud() {
    local cloud_provider="$1"
    local migration_type="$2" # lift_and_shift, re_architect, hybrid

    case "$cloud_provider" in
        "AWS")
            migrate_to_aws "$migration_type"
            ;;
        "Azure")
            migrate_to_azure "$migration_type"
            ;;
        "GCP")
            migrate_to_gcp "$migration_type"
            ;;
    esac
}

migrate_to_aws() {
    local migration_type="$1"

    case "$migration_type" in
        "lift_and_shift")
            # Direct migration to EC2
            aws dms create-replication-instance \
                --replication-instance-identifier dw-migration \
                --replication-instance-class dms.r5.xlarge

            aws dms create-endpoint \
                --endpoint-identifier source-endpoint \
                --endpoint-type source \
                --engine-name oracle

            aws dms create-endpoint \
                --endpoint-identifier target-endpoint \
                --endpoint-type target \
                --engine-name aurora-postgresql
            ;;

        "re_architect")
            # Migration to cloud-native services
            migrate_to_redshift
            setup_glue_etl_jobs
            configure_quicksight_reporting
            ;;
    esac
}

migrate_to_redshift() {
    # Create Redshift cluster
    aws redshift create-cluster \
        --cluster-identifier dw-redshift \
        --node-type ra3.xlplus \
        --number-of-nodes 3 \
        --master-username dwadmin \
        --master-user-password SecurePassword123

    # Setup data pipeline
    aws datapipeline create-pipeline \
        --name dw-migration-pipeline \
        --unique-id dw-migration-$(date +%s)
}

5. Rollback and Contingency:

-- Rollback procedures
CREATE OR REPLACE PROCEDURE execute_migration_rollback(
    p_migration_id NUMBER
) AS
    v_source_object VARCHAR2(100);
    v_target_object VARCHAR2(100);
    v_backup_location VARCHAR2(200);
BEGIN
    -- Get migration details
    SELECT source_object, target_object
    INTO v_source_object, v_target_object
    FROM migration_control
    WHERE migration_id = p_migration_id;

    -- Restore from backup
    v_backup_location := '/backup/pre_migration/' || p_migration_id || '/';

    -- Execute rollback
    EXECUTE IMMEDIATE 'DROP TABLE ' || v_target_object;

    -- Restore original table
    EXECUTE IMMEDIATE 'CREATE TABLE ' || v_target_object ||
                     ' AS SELECT * FROM ' || v_backup_location || v_source_object;

    -- Update rollback status
    UPDATE migration_control
    SET migration_status = 'ROLLED_BACK',
        end_time = SYSTIMESTAMP
    WHERE migration_id = p_migration_id;

    -- Log rollback
    INSERT INTO migration_log VALUES (
        SYSDATE, p_migration_id, 'ROLLBACK_COMPLETED',
        'Migration rolled back successfully'
    );
END;

50. Future Trends in Data Warehousing

Answer: Emerging Technologies and Trends:

1. Cloud-Native Data Warehousing:

-- Serverless data warehouse example (Snowflake-style)
CREATE WAREHOUSE compute_wh
WITH
    WAREHOUSE_SIZE = 'X-SMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE
    INITIALLY_SUSPENDED = TRUE
    COMMENT = 'Auto-scaling compute for ETL workloads';

-- Elastic scaling
ALTER WAREHOUSE compute_wh SET WAREHOUSE_SIZE = 'LARGE';

-- Multi-cluster warehouses for concurrency
CREATE WAREHOUSE reporting_wh
WITH
    WAREHOUSE_SIZE = 'MEDIUM'
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 10
    SCALING_POLICY = 'STANDARD'
    AUTO_SUSPEND = 300;

2. Real-Time Analytics and Streaming:

-- Streaming data integration
CREATE STREAM customer_changes_stream
ON TABLE customers;

-- Real-time materialized views
CREATE MATERIALIZED VIEW real_time_sales_summary
BUILD IMMEDIATE
REFRESH FAST ON COMMIT
AS
SELECT
    product_category,
    sale_date,
    SUM(sales_amount) as daily_sales,
    COUNT(*) as transaction_count
FROM sales_fact
WHERE sale_date >= TRUNC(SYSDATE)
GROUP BY product_category, sale_date;

-- Change data capture integration
CREATE OR REPLACE PROCEDURE process_streaming_changes AS
BEGIN
    -- Process changes from stream
    INSERT INTO sales_fact_realtime
    SELECT * FROM sales_changes_stream
    WHERE change_type = 'INSERT';

    -- Update real-time aggregates
    REFRESH MATERIALIZED VIEW real_time_sales_summary;
END;

3. Machine Learning Integration:

-- ML-powered data quality
CREATE OR REPLACE FUNCTION ml_data_quality_check(
    p_table_name VARCHAR2,
    p_column_name VARCHAR2
) RETURN VARCHAR2 AS
    v_anomaly_score NUMBER;
    v_quality_status VARCHAR2(20);
BEGIN
    -- Use ML model to detect anomalies
    SELECT ml_anomaly_detection(column_values)
    INTO v_anomaly_score
    FROM (SELECT p_column_name as column_values FROM p_table_name);

    v_quality_status := CASE
        WHEN v_anomaly_score > 0.8 THEN 'ANOMALY_DETECTED'
        WHEN v_anomaly_score > 0.6 THEN 'QUALITY_WARNING'
        ELSE 'QUALITY_OK'
    END;

    RETURN v_quality_status;
END;

-- Predictive analytics in DW
CREATE TABLE ml_predictions (
    prediction_id NUMBER PRIMARY KEY,
    model_name VARCHAR2(100),
    input_features CLOB,
    prediction_value NUMBER,
    confidence_score NUMBER,
    prediction_date TIMESTAMP,
    actual_value NUMBER,
    model_accuracy NUMBER
);

4. Data Mesh Architecture:

#!/bin/bash
# Data mesh implementation

setup_data_mesh() {
    local domains=("sales" "marketing" "finance" "operations")

    for domain in "${domains[@]}"; do
        echo "Setting up data mesh for $domain domain"

        # Create domain-specific data products
        create_data_products "$domain"

        # Setup domain ownership
        assign_domain_ownership "$domain"

        # Implement self-serve data platform
        setup_selfserve_platform "$domain"

        # Configure federated governance
        configure_federated_governance "$domain"
    done
}

create_data_products() {
    local domain="$1"

    # Create domain-specific schemas
    sqlplus -s admin/pass@db << EOF
CREATE SCHEMA ${domain}_data_products;

-- Domain-specific fact tables
CREATE TABLE ${domain}_data_products.${domain}_facts (
    fact_id NUMBER PRIMARY KEY,
    date_key NUMBER,
    domain_specific_metrics JSON,
    created_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Domain-specific APIs
CREATE OR REPLACE PACKAGE ${domain}_data_api AS
    FUNCTION get_metrics(p_date_range VARCHAR2) RETURN SYS_REFCURSOR;
    PROCEDURE publish_data_product(p_product_spec CLOB);
END;
EOF
}

5. Graph Analytics and Knowledge Graphs:

-- Graph-based data relationships
CREATE TABLE entity_relationships (
    relationship_id NUMBER PRIMARY KEY,
    source_entity_type VARCHAR2(50),
    source_entity_id VARCHAR2(100),
    relationship_type VARCHAR2(50),
    target_entity_type VARCHAR2(50),
    target_entity_id VARCHAR2(100),
    relationship_strength NUMBER,
    created_date TIMESTAMP
);

-- Graph traversal queries
WITH RECURSIVE entity_network AS (
    -- Starting entities
    SELECT
        source_entity_id as entity_id,
        source_entity_type as entity_type,
        0 as depth,
        CAST(source_entity_id AS VARCHAR(1000)) as path
    FROM entity_relationships
    WHERE source_entity_id = 'CUSTOMER_12345'

    UNION ALL

    -- Connected entities
    SELECT
        er.target_entity_id,
        er.target_entity_type,
        en.depth + 1,
        en.path || ' -> ' || er.target_entity_id
    FROM entity_relationships er
    JOIN entity_network en ON er.source_entity_id = en.entity_id
    WHERE en.depth < 3
)
SELECT * FROM entity_network;

6. Quantum Computing Preparation:

-- Quantum-ready data structures
CREATE TABLE quantum_optimized_data (
    quantum_state_id NUMBER PRIMARY KEY,
    classical_data CLOB,
    quantum_encoding BLOB,
    entanglement_metadata JSON,
    quantum_algorithm_hint VARCHAR2(100),
    processing_complexity_class VARCHAR2(50)
);

-- Prepare for quantum speedup in specific algorithms
CREATE OR REPLACE PACKAGE quantum_algorithms AS
    -- Quantum database search preparation
    FUNCTION prepare_grover_search(p_dataset_id NUMBER) RETURN CLOB;

    -- Quantum machine learning data prep
    FUNCTION prepare_qml_dataset(p_features CLOB) RETURN BLOB;
END;

7. Autonomous Data Warehouse:

-- Self-tuning capabilities
CREATE TABLE autonomous_tuning_log (
    tuning_id NUMBER PRIMARY KEY,
    detected_issue VARCHAR2(200),
    recommended_action CLOB,
    action_taken CLOB,
    performance_improvement_pct NUMBER,
    tuning_timestamp TIMESTAMP,
    auto_applied CHAR(1)
);

-- Self-healing procedures
CREATE OR REPLACE PROCEDURE autonomous_optimization AS
    v_slow_queries SYS_REFCURSOR;
    v_index_recommendations SYS_REFCURSOR;
BEGIN
    -- Identify performance issues
    OPEN v_slow_queries FOR
        SELECT sql_text, avg_execution_time
        FROM query_performance_log
        WHERE avg_execution_time > 10000; -- 10 seconds

    -- Generate automatic recommendations
    FOR query_rec IN v_slow_queries LOOP
        -- Auto-create missing indexes
        auto_create_recommended_indexes(query_rec.sql_text);

        -- Auto-update statistics
        auto_update_table_statistics(query_rec.sql_text);

        -- Auto-tune SQL
        auto_tune_sql_statement(query_rec.sql_text);
    END LOOP;
END;

8. Sustainability and Green Computing:

#!/bin/bash
# Green data warehouse practices

optimize_energy_efficiency() {
    echo "Implementing green DW practices..."

    # Workload scheduling during off-peak hours
    schedule_low_priority_jobs_off_peak

    # Auto-shutdown of non-critical resources
    implement_auto_shutdown_policies

    # Optimize query efficiency to reduce compute
    analyze_and_optimize_resource_intensive_queries

    # Carbon footprint tracking
    track_carbon_footprint

    echo "Green optimization completed"
}

track_carbon_footprint() {
    local cpu_hours=$(get_monthly_cpu_hours)
    local storage_gb=$(get_monthly_storage_usage)
    local network_gb=$(get_monthly_network_transfer)

    # Calculate carbon footprint (example formula)
    local carbon_kg=$(echo "scale=2; ($cpu_hours * 0.5) + ($storage_gb * 0.001) + ($network_gb * 0.01)" | bc)

    echo "Monthly carbon footprint: ${carbon_kg} kg CO2"

    # Log for sustainability reporting
    log_sustainability_metrics "$carbon_kg"
}

Future Technology Integration Roadmap:

• 2024-2025: Cloud-native adoption, real-time analytics

• 2026-2027: AI/ML integration, autonomous optimization

• 2028-2029: Quantum computing pilot projects

• 2030+: Fully autonomous, sustainable data ecosystems

Each topic represents the evolution of data warehousing toward more intelligent, autonomous, and sustainable platforms.

Data Warehouse Concepts - Interview Q&A

No comments:

Post a Comment