Data
Warehouse Concepts - Interview Q&A
Schema Design &
Architecture
1. What are
Facts and Dimensions? Explain with examples.
Answer:
•
Facts: Quantitative, measurable data that represents business metrics
–
Examples: Sales amount,
quantity sold, profit, revenue
–
Characteristics: Numeric,
additive, time-variant
•
Dimensions: Descriptive attributes that provide context to facts
–
Examples: Customer, Product,
Time, Location
–
Characteristics: Textual,
categorical, hierarchical
Real-world
scenario: In a retail sales fact table,
sales_amount is a fact, while customer_name, product_category, and sale_date
are dimensions.
2. Are Facts and
Dimensional tables related to each other?
Answer: Yes, through foreign key relationships:
•
Fact tables contain foreign
keys referencing dimension tables
•
This creates a star or
snowflake schema
•
Enables drilling down and
slicing/dicing of data
•
Supports OLAP operations and
analytics
3. When loading
data into Facts and Dimension tables, which one will you load first?
Answer: Dimensions first, then Facts because:
•
Facts contain foreign keys
referencing dimensions
•
Referential integrity must be
maintained
•
Dimension keys are needed
before fact loading
•
Loading sequence: Dimensions → Bridge tables → Facts
Best Practice: Use surrogate keys and implement proper error handling for missing
dimension references.
4. Difference
between Star and Snowflake schemas?
Answer:
Aspect |
Star Schema |
Snowflake
Schema |
Structure |
Denormalized
dimensions |
Normalized
dimensions |
Complexity |
Simple, flat
structure |
Complex,
hierarchical |
Storage |
More storage space |
Less storage space |
Query
Performance |
Faster queries |
Slower due to
joins |
Maintenance |
Easier to maintain |
More complex
maintenance |
Data Redundancy |
Higher redundancy |
Lower redundancy |
When to use Star: OLAP, reporting, simple analytics
When to use
Snowflake: Data integrity critical, storage
optimization
5. What is SCD
(Slowly Changing Dimensions)?
Answer: SCD handles changes in dimension data over time:
SCD Type 0: No changes allowed SCD Type 1: Overwrite old values SCD
Type 2: Create new records with versioning SCD Type 3: Add new
columns for changes SCD Type 4: Use separate history table SCD Type 6:
Combination of 1, 2, and 3
6. SCD Type 2
Scenario and Implementation
Scenario: Customer changes address from “Bangalore” to “Hyderabad”
Before Change:
Cust_Key | Cust_ID | Name | Address | Start_Date | End_Date | Current_Flag
1
| 101 | John | Bangalore |
2023-01-01 | NULL | Y
After Change:
Cust_Key | Cust_ID | Name | Address | Start_Date | End_Date | Current_Flag
1
| 101 | John | Bangalore |
2023-01-01 | 2024-01-15 | N
2
| 101 | John | Hyderabad |
2024-01-16 | NULL | Y
Implementation
Flags:
•
Current_Flag: Y/N
•
Effective_Date: Start date
•
Expiry_Date: End date
•
Version_Number: 1, 2, 3…
7. What is a
Data Mart?
Answer:
•
Subset of data warehouse focused on specific business area
•
Subject-oriented: Sales, Finance, HR marts
•
Faster query performance due to smaller size
•
Departmental access with relevant data only
•
Types: Dependent (from DW) vs Independent (direct from sources)
8. Explain OLTP
vs OLAP
Answer:
Aspect |
OLTP |
OLAP |
Purpose |
Transaction
processing |
Analytical
processing |
Data |
Current,
operational |
Historical,
summarized |
Schema |
Normalized (3NF) |
Denormalized
(Star/Snowflake) |
Queries |
Simple, frequent |
Complex, ad-hoc |
Users |
Many concurrent |
Few analytical |
Response Time |
Milliseconds |
Seconds to minutes |
9. What is Data
Lineage?
Answer:
•
Tracks data flow from source to target
•
Documents transformations applied to data
•
Enables impact analysis for changes
•
Supports compliance and auditing
•
Tools: Informatica Metadata Manager, Collibra
10. Explain
Conformed Dimensions
Answer:
•
Shared dimensions across multiple fact tables
•
Same structure and values in all uses
•
Enables drill-across queries between facts
•
Example: Time dimension used by Sales and Inventory facts
•
Benefits: Consistency, reusability, integrated reporting
Advanced DWH Concepts
11. What is a
Surrogate Key and why use it?
Answer:
•
System-generated unique
identifier (usually integer)
•
Independent of business data changes
•
Benefits:
–
Performance (integer joins
faster)
–
SCD Type 2 support
–
Handles source system changes
–
Smaller storage footprint
12. Explain Type
2 Dimension Handling Strategies
Answer:
Version Flag Method:
Current_Flag: Y/N indicating active record
Effective Date Method:
Start_Date and End_Date for validity period
Version Number Method:
Version: 1, 2, 3... for each change
Best Practice: Combine multiple methods
for robustness.
13. What is
Aggregate Tables and when to use them?
Answer:
•
Pre-calculated summary
tables for faster queries
•
When to use:
–
Frequently accessed summary
data
–
Complex calculations taking
long time
–
Large fact tables with millions
of records
•
Types: Daily, Monthly, Yearly aggregates
•
Maintenance: Refresh strategies (incremental/full)
14. Explain
Factless Fact Tables
Answer:
•
Fact tables without measures
•
Capture events or
relationships
•
Examples:
–
Student course enrollment
–
Product promotions
–
Employee training attendance
•
Contains only foreign keys to dimensions
•
Used for counting
occurrences of events
15. What is
Change Data Capture (CDC)?
Answer:
•
Identifies and captures
changes in source systems
•
Types:
–
Timestamp-based: Using modification dates
–
Trigger-based: Database triggers capture changes
–
Log-based: Reading transaction logs
–
Snapshot comparison: Compare current vs previous
•
Benefits: Efficient incremental loading, reduced processing time
16. What is a
Staging Area and its purpose?
Answer: Purpose of Staging Area:
•
Temporary storage for raw source data
•
Data validation and cleansing before loading
•
Transformation processing area
•
Error handling and data quality checks
•
Recovery point for failed loads
Staging Area Characteristics:
•
Minimal transformation - raw data storage
•
Short retention period (24-48 hours typically)
•
No complex business rules applied
•
Same structure as source or flat structure
Implementation:
--
Staging table structure
CREATE TABLE stg_customers (
customer_id NUMBER,
customer_name VARCHAR2(100),
address VARCHAR2(200),
phone VARCHAR2(15),
email VARCHAR2(100),
load_date DATE DEFAULT SYSDATE,
source_system VARCHAR2(20),
record_status VARCHAR2(10) DEFAULT 'NEW'
);
17. Explain
Kimball vs Inmon Methodology
Answer:
Aspect |
Kimball (Bottom-Up) |
Inmon (Top-Down) |
Approach |
Dimensional modeling first |
Enterprise data model first |
Start Point |
Data marts |
Enterprise data warehouse |
Implementation |
Faster, iterative |
Longer, comprehensive |
Data Storage |
Denormalized (star schema) |
Normalized (3NF) |
Cost |
Lower initial cost |
Higher initial investment |
Business Value |
Quick wins, faster ROI |
Long-term enterprise solution |
Complexity |
Simpler to understand |
More complex architecture |
Kimball Approach:
1.
Identify business process
2.
Declare the grain
3.
Identify dimensions
4.
Identify facts
Inmon Approach:
1.
Build enterprise data model
2.
Create integrated data
warehouse
3.
Build data marts from EDW
4.
Apply business rules
consistently
18. What is
Grain in Dimensional Modeling?
Answer: Definition: The most
atomic level of detail captured in a fact table
Grain Examples:
•
Sales Fact: One row per line item per transaction
•
Inventory Fact: One row per product per location per day
•
Web Analytics: One row per page view per session
Grain Declaration Process:
1.
Business Requirements: What questions need answers?
2.
Source System Analysis: What detail is available?
3.
Performance Considerations: Storage and query performance
4.
Future Flexibility: Ability to drill down
Impact on Design:
•
Determines fact table size and performance
•
Affects dimension
relationships
•
Influences aggregation
strategies
•
Drives ETL complexity
--
Example: Sales fact table grain
CREATE TABLE sales_fact (
-- Grain: One row per
product sold per transaction
date_key INTEGER,
customer_key INTEGER,
product_key INTEGER,
store_key INTEGER,
transaction_id VARCHAR(20),
line_number INTEGER,
-- Facts/Measures
quantity_sold NUMBER,
unit_price NUMBER(10,2),
extended_amount NUMBER(12,2),
discount_amount NUMBER(10,2)
);
19. What are
Conformed Dimensions and their benefits?
Answer: Definition: Shared
dimensions across multiple fact tables with identical structure and content
Benefits:
•
Drill-across queries between different facts
•
Consistent reporting across business processes
•
Reduced storage through dimension reuse
•
Simplified maintenance - single version of truth
Implementation:
--
Shared time dimension
CREATE TABLE date_dim (
date_key INTEGER PRIMARY KEY,
full_date DATE,
day_of_week VARCHAR(10),
month_name VARCHAR(10),
quarter VARCHAR(10),
fiscal_year INTEGER,
is_holiday CHAR(1)
);
-- Used by
multiple fact tables
CREATE TABLE sales_fact (
date_key INTEGER REFERENCES date_dim(date_key),
-- other dimensions and
facts
);
CREATE TABLE inventory_fact (
date_key INTEGER REFERENCES date_dim(date_key),
-- other dimensions and
facts
);
Drill-Across Query Example:
--
Compare sales and inventory by month
SELECT
d.month_name,
s.total_sales,
i.avg_inventory
FROM date_dim d
LEFT JOIN (
SELECT date_key, SUM(extended_amount) AS total_sales
FROM sales_fact GROUP BY date_key
) s ON d.date_key = s.date_key
LEFT JOIN (
SELECT date_key, AVG(quantity_on_hand) AS avg_inventory
FROM inventory_fact GROUP BY date_key
) i ON d.date_key = i.date_key
WHERE d.fiscal_year = 2024;
20. Explain Type
4 and Type 6 SCD implementations
Answer:
SCD Type 4 - History Table:
•
Current table: Latest values only
•
History table: All historical versions
•
Benefits: Fast current data access, complete history
--
Current customer table
CREATE TABLE customer_current (
customer_id INTEGER PRIMARY KEY,
customer_name VARCHAR(100),
address VARCHAR(200),
phone VARCHAR(15),
email VARCHAR(100),
last_updated TIMESTAMP
);
-- Customer
history table
CREATE TABLE customer_history (
customer_id INTEGER,
customer_name VARCHAR(100),
address VARCHAR(200),
phone VARCHAR(15),
email VARCHAR(100),
effective_date DATE,
end_date DATE,
change_reason VARCHAR(50),
PRIMARY KEY (customer_id,
effective_date)
);
SCD Type 6 - Hybrid Approach (1+2+3):
•
Type 1: Current value always updated
•
Type 2: Historical versions maintained
•
Type 3: Previous value stored in separate column
CREATE TABLE customer_type6 (
customer_key INTEGER PRIMARY KEY, -- Surrogate key
customer_id INTEGER, --
Natural key
customer_name_current VARCHAR(100), -- Type 1: Always current
customer_name_previous VARCHAR(100), -- Type 3: Previous value
customer_name_original VARCHAR(100), -- Type 3: Original value
address_current VARCHAR(200), -- Type
1: Always current
address_historical VARCHAR(200), -- Type 2: Historical value
effective_date DATE, -- Type
2: Effective date
expiry_date DATE, -- Type
2: Expiry date
current_flag CHAR(1), -- Type
2: Current indicator
version_number INTEGER -- Type
2: Version tracking
);
21. What is a
Bus Matrix and its importance?
Answer: Definition:
Enterprise-wide matrix showing business processes and their associated
dimensions
Components:
•
Rows: Business processes (Order Management, Inventory, HR)
•
Columns: Dimensions (Customer, Product, Time, Location)
•
Intersections: Mark which dimensions apply to each process
Benefits:
•
Enterprise planning roadmap
•
Dimension standardization across projects
•
Integration opportunities identification
•
Resource allocation planning
Example Bus Matrix:
Business Process |
Customer |
Product |
Time |
Location |
Employee |
Vendor |
Sales |
X |
X |
X |
X |
|
|
Inventory |
|
X |
X |
X |
|
|
Purchasing |
|
X |
X |
X |
|
X |
HR |
|
|
X |
X |
X |
|
Finance |
X |
|
X |
X |
X |
X |
22. Explain Late
Arriving Dimensions and handling strategies
Answer: Problem: Fact records
arrive before their corresponding dimension records
Handling Strategies:
1. Unknown Member Technique:
--
Create unknown/default member
INSERT INTO customer_dim VALUES (
-1, 'UNKNOWN', 'Unknown Customer', 'TBD', 'TBD', 'TBD',
'1900-01-01', '9999-12-31', 'Y', 1
);
-- Use unknown key
for late arriving facts
INSERT INTO sales_fact
SELECT
date_key,
COALESCE(c.customer_key, -1) AS customer_key,
product_key,
sales_amount
FROM staging_sales s
LEFT JOIN customer_dim c ON s.customer_id = c.customer_id;
2. Inferred Member Technique:
--
Create minimal dimension record when fact arrives
INSERT INTO customer_dim (
customer_key, customer_id, customer_name,
start_date, end_date, current_flag,
inferred_flag
)
SELECT
customer_seq.NEXTVAL, customer_id, 'Inferred Member',
SYSDATE, '9999-12-31', 'Y', 'Y'
FROM staging_sales s
WHERE NOT EXISTS (
SELECT 1 FROM customer_dim c
WHERE c.customer_id = s.customer_id
);
-- Update when
actual dimension data arrives
UPDATE customer_dim
SET customer_name = actual_name,
address = actual_address,
inferred_flag = 'N'
WHERE customer_id =
incoming_customer_id
AND inferred_flag = 'Y';
23. What is a
Factless Fact Table?
Answer: Definition: Fact table
containing only foreign keys to dimensions, no numeric measures
Types:
1. Coverage/Event Factless Facts:
•
Records events or relationships
•
Example: Student course
enrollment, employee training
CREATE TABLE
student_enrollment_fact (
date_key INTEGER,
student_key INTEGER,
course_key INTEGER,
instructor_key INTEGER,
classroom_key INTEGER,
-- No measures, just the
relationship
PRIMARY KEY (date_key,
student_key, course_key)
);
2. Snapshot/Condition Factless Facts:
•
Records states or conditions at
points in time
•
Example: Account balances,
inventory positions
CREATE TABLE
promotion_coverage_fact (
date_key INTEGER,
customer_key INTEGER,
product_key INTEGER,
promotion_key INTEGER,
store_key INTEGER,
-- Records which customers
were eligible for which promotions
PRIMARY KEY (date_key,
customer_key, product_key, promotion_key)
);
Query Examples:
--
Count students enrolled in each course
SELECT
c.course_name,
COUNT(*) AS enrollment_count
FROM
student_enrollment_fact sef
JOIN course_dim c ON sef.course_key = c.course_key
JOIN date_dim d ON sef.date_key = d.date_key
WHERE d.semester = 'Fall 2024'
GROUP BY c.course_name;
-- Find customers
NOT in a promotion
SELECT customer_key
FROM customer_dim
WHERE customer_key NOT IN (
SELECT DISTINCT customer_key
FROM
promotion_coverage_fact
WHERE promotion_key = 123
);
24. Explain Data
Vault Methodology
Answer: Components:
1. Hubs: Business keys and metadata
CREATE TABLE hub_customer (
customer_hash_key CHAR(32) PRIMARY KEY,
customer_id VARCHAR(20),
load_date TIMESTAMP,
record_source VARCHAR(50)
);
2. Links: Relationships between business
keys
CREATE TABLE link_customer_order
(
customer_order_hash_key CHAR(32) PRIMARY KEY,
customer_hash_key CHAR(32),
order_hash_key CHAR(32),
load_date TIMESTAMP,
record_source VARCHAR(50)
);
3. Satellites: Descriptive attributes
CREATE TABLE
sat_customer_details (
customer_hash_key CHAR(32),
load_date TIMESTAMP,
customer_name VARCHAR(100),
address VARCHAR(200),
phone VARCHAR(15),
email VARCHAR(100),
hash_diff CHAR(32),
record_source VARCHAR(50),
PRIMARY KEY (customer_hash_key,
load_date)
);
Benefits:
•
Audit trail by design
•
Parallel loading capabilities
•
Source system agnostic
•
Highly scalable architecture
25. What is OLAP
and types of OLAP operations?
Answer: OLAP Operations:
1. Drill Down: Navigate from summary to
detail
-- From
yearly to monthly to daily
SELECT year, month, day, SUM(sales_amount)
FROM sales_fact sf
JOIN date_dim d ON sf.date_key = d.date_key
GROUP BY year, month, day;
2. Drill Up (Roll Up): Navigate from
detail to summary
-- From
daily to monthly
SELECT year, month, SUM(sales_amount)
FROM sales_fact sf
JOIN date_dim d ON sf.date_key = d.date_key
GROUP BY year, month;
3. Slice: Fix one dimension, analyze
others
--
Slice by specific product category
SELECT customer_region,
time_period, SUM(sales_amount)
FROM sales_fact sf
JOIN product_dim p ON sf.product_key = p.product_key
WHERE p.category = 'Electronics';
4. Dice: Filter multiple dimensions
-- Dice
by region, time period, and product
SELECT SUM(sales_amount)
FROM sales_fact sf
JOIN customer_dim c ON sf.customer_key = c.customer_key
JOIN date_dim d ON sf.date_key = d.date_key
JOIN product_dim p ON sf.product_key = p.product_key
WHERE c.region IN ('North', 'South')
AND d.year = 2024
AND p.category = 'Electronics';
5. Pivot (Rotation): Swap rows and
columns
--
Pivot months to columns
SELECT * FROM (
SELECT region, month_name,
sales_amount
FROM sales_summary
) PIVOT (
SUM(sales_amount)
FOR month_name IN ('Jan', 'Feb', 'Mar', 'Apr')
);
26-50.
Additional DWH Concepts
26. Dimensional
Modeling Best Practices
Answer: Design Guidelines:
•
Bus Matrix for enterprise planning
•
Conformed dimensions for integration
•
Consistent grain declaration
•
Descriptive attribute naming
•
Surrogate keys for all dimensions
•
Graceful handling of data quality issues
Naming Conventions:
--
Dimension tables: [subject]_dim
customer_dim,
product_dim, date_dim
-- Fact tables:
[process]_fact
sales_fact,
inventory_fact, shipment_fact
-- Keys:
[table]_key
customer_key,
product_key, date_key
-- Measures:
descriptive names
quantity_sold,
extended_amount, discount_amount
27. Data Quality
Framework
Answer: Data Quality Dimensions:
•
Completeness: No missing values
•
Accuracy: Correct values
•
Consistency: Same values across systems
•
Validity: Values within acceptable ranges
•
Timeliness: Data available when needed
•
Uniqueness: No duplicates
Implementation:
-- Data
quality checks
CREATE TABLE data_quality_rules
(
rule_id INTEGER,
table_name VARCHAR(50),
column_name VARCHAR(50),
rule_type VARCHAR(20),
rule_definition TEXT,
threshold_value NUMBER
);
-- Quality
monitoring
CREATE TABLE
data_quality_results (
execution_date DATE,
rule_id INTEGER,
records_checked INTEGER,
records_failed INTEGER,
failure_rate NUMBER,
status VARCHAR(10)
);
28. Master Data
Management (MDM)
Answer: MDM Components:
•
Golden Record: Single version of truth
•
Data Governance: Policies and procedures
•
Data Stewardship: Ownership and accountability
•
Match and Merge: Duplicate identification
•
Data Lineage: Source tracking
Implementation Approach:
--
Customer master with confidence scoring
CREATE TABLE customer_master (
master_customer_id INTEGER,
source_system VARCHAR(20),
source_customer_id VARCHAR(50),
customer_name VARCHAR(100),
confidence_score NUMBER(3,2),
match_rule VARCHAR(50),
created_date TIMESTAMP,
last_updated TIMESTAMP
);
29. Real-time
Data Warehousing
Answer: Architecture Components:
•
Change Data Capture (CDC)
•
Message Queues (Kafka, MQ)
•
Stream Processing (Storm, Spark)
•
In-Memory Databases
•
Operational Data Store (ODS)
Implementation Patterns:
-- Near
real-time using triggers
CREATE OR REPLACE TRIGGER
customer_cdc_trigger
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW
BEGIN
INSERT INTO
cdc_customer_changes VALUES (
CASE
WHEN INSERTING THEN 'I'
WHEN UPDATING THEN 'U'
WHEN DELETING THEN 'D'
END,
COALESCE(:NEW.customer_id, :OLD.customer_id),
SYSDATE
);
END;
30. Data Vault
2.0 Enhancements
Answer: New Features:
•
Hash Keys: For performance and distribution
•
Business Vault: Calculated and derived data
•
NoSQL Integration: Handle unstructured data
•
Parallel Processing: Optimized for big data
-- Hash
key generation
CREATE OR REPLACE FUNCTION generate_hash_key(
p_business_key VARCHAR2
) RETURN VARCHAR2 IS
BEGIN
RETURN UPPER(
DBMS_CRYPTO.HASH(
UTL_RAW.CAST_TO_RAW(UPPER(TRIM(p_business_key))),
DBMS_CRYPTO.HASH_MD5
)
);
END;
31. Lambda
Architecture for Data Warehousing
Answer: Three Layers:
•
Batch Layer: Historical data processing
•
Speed Layer: Real-time stream processing
•
Serving Layer: Query interface
Implementation:
--
Batch view (historical)
CREATE TABLE
batch_customer_metrics AS
SELECT
customer_id,
date_key,
SUM(purchase_amount) AS total_purchases,
COUNT(*) AS transaction_count
FROM sales_history
GROUP BY customer_id,
date_key;
-- Real-time view
(current)
CREATE TABLE
realtime_customer_metrics AS
SELECT
customer_id,
CURRENT_DATE AS date_key,
SUM(purchase_amount) AS total_purchases,
COUNT(*) AS transaction_count
FROM sales_stream
WHERE event_time >= CURRENT_DATE
GROUP BY customer_id;
32. Data
Warehouse Automation
Answer: Automation Areas:
•
Schema generation from source systems
•
ETL code generation
•
Data lineage documentation
•
Test case generation
•
Deployment automation
Tools and Techniques:
--
Metadata-driven ETL
CREATE TABLE etl_metadata (
table_name VARCHAR(50),
source_column VARCHAR(50),
target_column VARCHAR(50),
transformation_rule TEXT,
data_type VARCHAR(20),
is_key CHAR(1)
);
-- Generate ETL
code from metadata
SELECT
'INSERT INTO ' || target_table ||
' SELECT ' ||
LISTAGG(target_column, ', ') ||
' FROM ' || source_table AS generated_sql
FROM etl_metadata
GROUP BY target_table,
source_table;
33. Cloud Data
Warehousing
Answer: Cloud DW Characteristics:
•
Elastic scaling based on demand
•
Pay-per-use pricing models
•
Managed services reduce maintenance
•
Global distribution capabilities
•
Integration with cloud ecosystem
Snowflake Example:
--
Virtual warehouse for different workloads
CREATE WAREHOUSE
etl_warehouse
WITH WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
CREATE WAREHOUSE
reporting_warehouse
WITH WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
-- Time travel for
historical queries
SELECT * FROM customer_dim
AT (TIMESTAMP => '2024-01-01
09:00:00');
34. Data Lake
Integration
Answer: Data Lake vs Data Warehouse:
Aspect |
Data Lake |
Data Warehouse |
Schema |
Schema-on-read |
Schema-on-write |
Data Types |
Structured, semi-structured, unstructured |
Structured |
Processing |
ELT (Extract, Load, Transform) |
ETL (Extract, Transform, Load) |
Cost |
Lower storage cost |
Higher storage cost |
Agility |
High agility |
Lower agility |
Integration Architecture:
--
External table on data lake files
CREATE EXTERNAL TABLE raw_customer_data (
customer_json STRING
)
LOCATION 's3://datalake/customers/'
TBLPROPERTIES ('has_encrypted_data'='false');
-- Transform and
load to DW
INSERT INTO customer_dim
SELECT
customer_seq.NEXTVAL,
json_extract_scalar(customer_json, '$.id'),
json_extract_scalar(customer_json, '$.name'),
json_extract_scalar(customer_json, '$.address')
FROM raw_customer_data;
35. Temporal
Data Warehousing
Answer: Temporal Aspects:
•
Valid Time: When fact was true in real world
•
Transaction Time: When fact was stored in database
•
Bitemporal: Both valid and transaction time
--
Bitemporal dimension table
CREATE TABLE customer_bitemporal
(
customer_key INTEGER,
customer_id INTEGER,
customer_name VARCHAR(100),
address VARCHAR(200),
-- Valid time
valid_from DATE,
valid_to DATE,
-- Transaction time
transaction_from TIMESTAMP,
transaction_to TIMESTAMP,
current_flag CHAR(1)
);
-- Query as of
specific valid and transaction time
SELECT * FROM customer_bitemporal
WHERE customer_id = 123
AND valid_from <= DATE '2024-01-01'
AND valid_to > DATE '2024-01-01'
AND transaction_from <= TIMESTAMP '2024-01-01
10:00:00'
AND transaction_to > TIMESTAMP '2024-01-01
10:00:00';
36. Data
Warehouse Security
Answer: Security Layers:
•
Network Security: VPNs, firewalls
•
Authentication: User identity verification
•
Authorization: Access control
•
Encryption: Data at rest and in transit
•
Auditing: Access logging
Implementation:
--
Row-level security
CREATE POLICY
customer_region_policy ON customer_dim
FOR SELECT TO business_user
USING (region = sys_context('APP_CTX', 'USER_REGION'));
-- Column-level
encryption
ALTER TABLE customer_dim
MODIFY (ssn ENCRYPT USING 'AES256');
-- Audit trail
CREATE TABLE access_audit (
user_name VARCHAR(50),
object_name VARCHAR(100),
action VARCHAR(20),
timestamp TIMESTAMP,
ip_address VARCHAR(15)
);
37. Data
Warehouse Testing
Answer: Testing Types:
•
Unit Testing: Individual transformations
•
Integration Testing: End-to-end data flow
•
Performance Testing: Load and stress testing
•
Data Quality Testing: Completeness, accuracy
•
User Acceptance Testing: Business validation
Automated Testing Framework:
-- Test
case definitions
CREATE TABLE test_cases (
test_id INTEGER,
test_name VARCHAR(100),
test_sql TEXT,
expected_result TEXT,
test_category VARCHAR(50)
);
-- Test results
tracking
CREATE TABLE test_results (
execution_id INTEGER,
test_id INTEGER,
execution_date TIMESTAMP,
actual_result TEXT,
test_status VARCHAR(10),
execution_time_ms INTEGER
);
38. Data
Warehouse Optimization
Answer: Optimization Strategies:
•
Partitioning: Improve query performance
•
Indexing: Accelerate data access
•
Materialized Views: Pre-aggregate data
•
Compression: Reduce storage requirements
•
Statistics: Optimize query plans
--
Partition pruning example
CREATE TABLE sales_fact (
sale_date DATE,
customer_id NUMBER,
product_id NUMBER,
sales_amount NUMBER
)
PARTITION BY RANGE (sale_date) (
PARTITION sales_2023 VALUES LESS THAN (DATE '2024-01-01'),
PARTITION sales_2024_q1 VALUES LESS THAN (DATE '2024-04-01'),
PARTITION sales_2024_q2 VALUES LESS THAN (DATE '2024-07-01')
);
-- Bitmap index
for low cardinality
CREATE BITMAP INDEX idx_customer_gender
ON
customer_dim(gender);
-- Aggregate table
for performance
CREATE TABLE
monthly_sales_summary AS
SELECT
DATE_TRUNC('month', sale_date) AS month,
customer_region,
product_category,
SUM(sales_amount) AS monthly_sales
FROM sales_fact_detailed
GROUP BY DATE_TRUNC('month', sale_date),
customer_region, product_category;
39. Data Lineage
and Impact Analysis
Answer: Data Lineage Components:
•
Source to Target mapping
•
Transformation Logic documentation
•
Dependency Relationships
•
Impact Analysis capabilities
•
Audit Trail maintenance
-- Data
lineage metadata
CREATE TABLE data_lineage (
lineage_id INTEGER,
source_system VARCHAR(50),
source_table VARCHAR(100),
source_column VARCHAR(100),
target_system VARCHAR(50),
target_table VARCHAR(100),
target_column VARCHAR(100),
transformation_rule TEXT,
created_date TIMESTAMP
);
-- Impact analysis
query
WITH RECURSIVE
impact_chain AS (
SELECT source_table,
target_table, 1 AS level
FROM data_lineage
WHERE source_table = 'customer_source'
UNION ALL
SELECT dl.source_table,
dl.target_table, ic.level + 1
FROM data_lineage dl
JOIN impact_chain ic ON dl.source_table = ic.target_table
WHERE ic.level < 10
)
SELECT DISTINCT target_table, level
FROM impact_chain
ORDER BY level;
40. Data
Governance Framework
Answer: Governance Components:
•
Data Stewardship: Ownership and accountability
•
Data Standards: Naming, quality, format
•
Data Policies: Usage and access rules
•
Data Catalog: Metadata repository
•
Compliance: Regulatory requirements
-- Data
stewardship tracking
CREATE TABLE data_stewards (
steward_id INTEGER,
steward_name VARCHAR(100),
department VARCHAR(50),
contact_email VARCHAR(100),
responsibilities TEXT
);
CREATE TABLE data_assets (
asset_id INTEGER,
asset_name VARCHAR(100),
asset_type VARCHAR(50),
steward_id INTEGER,
classification VARCHAR(20),
sensitivity_level VARCHAR(20)
);
41.
Multi-dimensional OLAP Cubes
Answer: OLAP Cube Architecture:
•
Dimensions: Business perspectives (Time, Product, Customer, Geography)
•
Measures: Quantitative data (Sales, Profit, Quantity)
•
Hierarchies: Drill-down paths within dimensions
•
Aggregations: Pre-calculated summaries at various levels
Implementation Approaches:
1. MOLAP (Multidimensional OLAP):
--
Create cube structure
CREATE CUBE sales_cube (
DIMENSION time_dim (
HIERARCHY time_hierarchy (
LEVEL year,
LEVEL quarter CHILD OF year,
LEVEL month CHILD OF quarter,
LEVEL day CHILD OF month
)
),
DIMENSION product_dim (
HIERARCHY product_hierarchy (
LEVEL category,
LEVEL subcategory CHILD OF category,
LEVEL product CHILD OF subcategory
)
),
DIMENSION customer_dim (
HIERARCHY customer_hierarchy
(
LEVEL region,
LEVEL country CHILD OF region,
LEVEL state CHILD OF country,
LEVEL city CHILD OF state
)
),
MEASURE sales_amount SUM,
MEASURE quantity_sold SUM,
MEASURE profit_margin AVG
);
2. ROLAP (Relational OLAP) Implementation:
-- Cube
view using star schema
CREATE VIEW sales_cube_view AS
SELECT
d.year, d.quarter, d.month,
p.category, p.subcategory,
p.product_name,
c.region, c.country, c.state, c.city,
SUM(f.sales_amount) AS total_sales,
SUM(f.quantity) AS total_quantity,
AVG(f.profit_margin) AS avg_profit_margin,
COUNT(*) AS transaction_count
FROM sales_fact f
JOIN date_dim d ON f.date_key = d.date_key
JOIN product_dim p ON f.product_key = p.product_key
JOIN customer_dim c ON f.customer_key = c.customer_key
GROUP BY CUBE(
d.year, d.quarter, d.month,
p.category, p.subcategory,
p.product_name,
c.region, c.country, c.state, c.city
);
3. Aggregation Design:
--
Pre-computed aggregation tables
CREATE TABLE agg_sales_monthly AS
SELECT
date_key, product_category,
customer_region,
SUM(sales_amount) AS monthly_sales,
SUM(quantity) AS monthly_quantity,
COUNT(*) AS transaction_count
FROM sales_fact
GROUP BY date_key,
product_category, customer_region;
-- Quarterly
aggregations
CREATE TABLE agg_sales_quarterly
AS
SELECT
quarter_key, product_category,
SUM(sales_amount) AS quarterly_sales,
AVG(monthly_sales) AS avg_monthly_sales
FROM agg_sales_monthly
GROUP BY quarter_key,
product_category;
MDX Query Examples:
-- MDX query for
sales analysis
SELECT
{[Time].[2024].[Q1].[January],
[Time].[2024].[Q1].[February]} ON COLUMNS,
{[Product].[Electronics],
[Product].[Clothing]} ON ROWS
FROM [Sales_Cube]
WHERE
[Customer].[Region].[North America]
-- Drill-down
query
SELECT
[Time].[Year].MEMBERS ON COLUMNS,
DESCENDANTS([Product].[Electronics],
[Product].[Product], SELF) ON ROWS
FROM [Sales_Cube]
Performance Optimization:
•
Aggregation Strategy: 80/20 rule - pre-compute 20% aggregations for 80% queries
•
Partitioning: Partition cubes by time periods
•
Caching: Implement intelligent caching for frequently accessed data
•
Incremental Processing: Update only changed partitions
42. Data
Warehouse Backup and Recovery
Answer: Backup Strategy Components:
1. Backup Types:
-- Full
backup strategy
BACKUP DATABASE datawarehouse
TO DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH COMPRESSION,
CHECKSUM, STATS = 10;
-- Differential
backup
BACKUP DATABASE datawarehouse
TO DISK = 'C:\Backup\DW_Diff_20240315.bak'
WITH DIFFERENTIAL,
COMPRESSION, CHECKSUM;
-- Transaction log
backup
BACKUP LOG datawarehouse
TO DISK = 'C:\Backup\DW_Log_20240315.trn'
WITH COMPRESSION,
CHECKSUM;
2. Backup Scheduling:
#!/bin/bash
# Automated backup
schedule
# Daily
differential backups
0 2 * * *
/scripts/backup_differential.sh
# Weekly full
backups
0 1 * * 0
/scripts/backup_full.sh
# Hourly
transaction log backups
0 * * * *
/scripts/backup_transaction_log.sh
# Monthly archive
to remote storage
0 3 1 * *
/scripts/backup_archive.sh
3. Recovery Scenarios:
Point-in-Time Recovery:
--
Restore to specific point in time
RESTORE DATABASE datawarehouse
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH NORECOVERY, REPLACE;
RESTORE DATABASE datawarehouse
FROM DISK = 'C:\Backup\DW_Diff_20240315.bak'
WITH NORECOVERY;
RESTORE LOG datawarehouse
FROM DISK = 'C:\Backup\DW_Log_20240315.trn'
WITH RECOVERY, STOPAT = '2024-03-15
14:30:00';
Partial Database Recovery:
--
Restore specific filegroups
RESTORE DATABASE datawarehouse
FILEGROUP = 'FACTS_2024'
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH PARTIAL,
NORECOVERY;
RESTORE DATABASE datawarehouse
FILEGROUP = 'DIMENSIONS'
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH NORECOVERY;
4. Backup Validation:
--
Verify backup integrity
RESTORE VERIFYONLY
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH CHECKSUM;
-- Test restore to
verify recoverability
RESTORE DATABASE datawarehouse_test
FROM DISK = 'C:\Backup\DW_Full_20240315.bak'
WITH MOVE 'DW_Data' TO 'C:\Test\DW_Test.mdf',
MOVE 'DW_Log' TO 'C:\Test\DW_Test.ldf',
REPLACE;
5. Cloud Backup Integration:
#!/bin/bash
# Cloud backup
integration
# Backup to Azure
Blob Storage
sqlcmd -Q "BACKUP
DATABASE datawarehouse
TO URL =
'https://account.blob.core.windows.net/backups/DW_$(date +%Y%m%d).bak'
WITH CREDENTIAL =
'AzureCredential', COMPRESSION;"
# Backup to AWS S3
aws s3 cp
/backup/DW_Full_20240315.bak s3://dw-backups/$(date +%Y/%m/%d)/
43. Change
Management in Data Warehouse
Answer: Change Management Framework:
1. Change Categories:
•
Schema Changes: New tables, columns, indexes
•
ETL Changes: Mapping modifications, new data sources
•
Business Logic Changes: Calculation rules, transformations
•
Infrastructure Changes: Hardware, software upgrades
2. Change Control Process:
--
Change tracking table
CREATE TABLE dw_change_log (
change_id NUMBER PRIMARY KEY,
change_type VARCHAR2(50),
object_name VARCHAR2(100),
change_description CLOB,
requested_by VARCHAR2(50),
approved_by VARCHAR2(50),
implemented_by VARCHAR2(50),
request_date DATE,
approval_date DATE,
implementation_date DATE,
rollback_plan CLOB,
status VARCHAR2(20),
environment VARCHAR2(20)
);
3. Version Control Integration:
#!/bin/bash
# Database version
control
# Schema
versioning
git add
database/schemas/
git commit -m "Schema change:
Added customer_preference table"
git tag -a v2.1.0 -m "Release 2.1.0
- Customer preferences"
# ETL code
versioning
git add
informatica/mappings/
git commit -m "ETL change:
Updated customer transformation logic"
# Deployment
script versioning
git add
deployment/scripts/
git commit -m "Deployment:
Added migration script for v2.1.0"
4. Environment Promotion:
--
Development to Test promotion
BEGIN
-- Validate changes in
development
IF
validate_schema_changes() = 'PASS' THEN
-- Deploy to test
environment
EXECUTE deploy_to_test();
-- Run regression tests
IF
run_regression_tests() = 'PASS' THEN
UPDATE dw_change_log
SET status = 'READY_FOR_PROD'
WHERE change_id = :change_id;
END IF;
END IF;
END;
5. Rollback Procedures:
--
Automated rollback capability
CREATE OR REPLACE PROCEDURE
rollback_change(p_change_id NUMBER) AS
v_rollback_script CLOB;
BEGIN
SELECT rollback_plan INTO v_rollback_script
FROM dw_change_log
WHERE change_id = p_change_id;
-- Execute rollback
EXECUTE IMMEDIATE v_rollback_script;
-- Log rollback
UPDATE dw_change_log
SET status = 'ROLLED_BACK',
rollback_date = SYSDATE
WHERE change_id = p_change_id;
END;
6. Impact Analysis:
-- Data
lineage impact analysis
WITH RECURSIVE
impact_analysis AS (
SELECT source_object,
target_object, 1 as level
FROM data_lineage
WHERE source_object = 'CHANGED_TABLE'
UNION ALL
SELECT dl.source_object,
dl.target_object, ia.level + 1
FROM data_lineage dl
JOIN impact_analysis ia ON dl.source_object = ia.target_object
WHERE ia.level < 5
)
SELECT target_object, level
FROM impact_analysis
ORDER BY level;
44. Performance
Monitoring and Tuning
Answer: Performance Monitoring
Framework:
1. Key Performance Indicators:
-- ETL
performance tracking
CREATE TABLE
etl_performance_metrics (
execution_id NUMBER,
workflow_name VARCHAR2(100),
start_time TIMESTAMP,
end_time TIMESTAMP,
duration_minutes NUMBER,
records_processed NUMBER,
throughput_per_minute NUMBER,
cpu_utilization_pct NUMBER,
memory_usage_mb NUMBER,
io_read_mb NUMBER,
io_write_mb NUMBER,
error_count NUMBER
);
-- Query
performance tracking
CREATE TABLE
query_performance_log (
query_id NUMBER,
sql_text CLOB,
execution_time_ms NUMBER,
logical_reads NUMBER,
physical_reads NUMBER,
cpu_time_ms NUMBER,
execution_plan_hash NUMBER,
execution_date TIMESTAMP
);
2. Automated Performance Monitoring:
#!/bin/bash
# Performance
monitoring script
monitor_dw_performance() {
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
# Database performance
metrics
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
local memory_usage=$(free | grep Mem | awk '{printf("%.2f",
$3/$2 * 100.0)}')
local disk_io=$(iostat -x 1 2 | tail -1 | awk '{print $4, $5}')
# ETL process monitoring
local running_workflows=$(ps aux | grep -c informatica)
local failed_sessions=$(grep -c "FAILED" /logs/informatica/*.log)
# Database connection
monitoring
local active_connections=$(netstat -an | grep :1521 | grep ESTABLISHED | wc -l)
# Log metrics
echo "$timestamp,$cpu_usage,$memory_usage,$disk_io,$running_workflows,$failed_sessions,$active_connections" \
>>
/logs/performance_metrics.csv
# Alert on thresholds
if (( $(echo "$cpu_usage > 80" | bc -l) )); then
send_alert "High CPU
usage: $cpu_usage%"
fi
if [ "$failed_sessions" -gt 0 ]; then
send_alert "ETL failures
detected: $failed_sessions failed
sessions"
fi
}
3. Query Performance Analysis:
--
Identify slow queries
SELECT
sql_text,
execution_count,
total_elapsed_time,
avg_elapsed_time,
total_logical_reads,
avg_logical_reads
FROM (
SELECT
sql_text,
COUNT(*) as execution_count,
SUM(execution_time_ms) as total_elapsed_time,
AVG(execution_time_ms) as avg_elapsed_time,
SUM(logical_reads) as
total_logical_reads,
AVG(logical_reads) as avg_logical_reads,
RANK() OVER (ORDER BY AVG(execution_time_ms) DESC) as performance_rank
FROM
query_performance_log
WHERE execution_date >= TRUNC(SYSDATE) - 7
GROUP BY sql_text
)
WHERE performance_rank <= 10;
4. Index Usage Analysis:
--
Index effectiveness analysis
SELECT
i.index_name,
i.table_name,
s.user_seeks,
s.user_scans,
s.user_lookups,
s.user_updates,
s.last_user_seek,
s.last_user_scan,
CASE
WHEN s.user_seeks + s.user_scans + s.user_lookups = 0
THEN 'UNUSED'
WHEN s.user_updates > (s.user_seeks + s.user_scans + s.user_lookups) * 10
THEN 'OVER_MAINTAINED'
ELSE 'ACTIVE'
END as index_status
FROM sys.indexes i
LEFT JOIN
sys.dm_db_index_usage_stats s
ON i.object_id = s.object_id AND i.index_id = s.index_id
WHERE i.index_id > 0;
5. Performance Tuning Recommendations:
--
Automatic tuning recommendations
WITH
performance_analysis AS (
SELECT
table_name,
AVG(scan_time_ms) as avg_scan_time,
SUM(scan_count) as total_scans,
MAX(last_scan_date) as last_activity
FROM table_scan_stats
GROUP BY table_name
)
SELECT
table_name,
CASE
WHEN avg_scan_time > 1000 AND total_scans > 100
THEN 'Consider
partitioning or indexing'
WHEN total_scans = 0 AND last_activity < SYSDATE - 90
THEN 'Consider archiving
- unused for 90+ days'
WHEN avg_scan_time > 5000
THEN 'Requires immediate
optimization'
ELSE 'Performance
acceptable'
END as recommendation
FROM
performance_analysis;
45. Data
Retention and Archival Strategies
Answer: Data Lifecycle Management:
1. Retention Policy Framework:
-- Data
retention policies
CREATE TABLE
data_retention_policies (
policy_id NUMBER PRIMARY KEY,
table_name VARCHAR2(100),
retention_period_months NUMBER,
archive_criteria VARCHAR2(500),
purge_criteria VARCHAR2(500),
compliance_requirement VARCHAR2(100),
policy_effective_date DATE,
created_by VARCHAR2(50)
);
-- Example
policies
INSERT INTO
data_retention_policies VALUES
(1, 'SALES_FACT', 84, 'age > 36 months', 'age > 84 months', 'SOX', DATE '2024-01-01', 'DBA_TEAM');
INSERT INTO
data_retention_policies VALUES
(2, 'CUSTOMER_INTERACTIONS', 36, 'age > 12 months', 'age > 36 months', 'GDPR', DATE '2024-01-01', 'DBA_TEAM');
2. Automated Archival Process:
--
Archive procedure
CREATE OR REPLACE PROCEDURE
archive_old_data(p_table_name VARCHAR2) AS
v_archive_criteria VARCHAR2(500);
v_archive_table VARCHAR2(100);
v_records_archived NUMBER;
BEGIN
-- Get archival criteria
SELECT archive_criteria INTO v_archive_criteria
FROM
data_retention_policies
WHERE table_name = p_table_name;
v_archive_table := p_table_name || '_ARCHIVE';
-- Create archive table if
not exists
EXECUTE IMMEDIATE 'CREATE TABLE ' || v_archive_table ||
' AS SELECT * FROM ' || p_table_name || ' WHERE 1=0';
-- Move data to archive
EXECUTE IMMEDIATE
'INSERT INTO ' || v_archive_table ||
' SELECT * FROM ' || p_table_name ||
' WHERE ' || v_archive_criteria;
v_records_archived := SQL%ROWCOUNT;
-- Delete archived data
from main table
EXECUTE IMMEDIATE
'DELETE FROM ' || p_table_name ||
' WHERE ' || v_archive_criteria;
-- Log archival activity
INSERT INTO archival_log VALUES (
SYSDATE, p_table_name,
v_records_archived, 'SUCCESS'
);
COMMIT;
END;
3. Hierarchical Storage Management:
#!/bin/bash
# Tiered storage
implementation
# Tier 1:
High-performance SSD (current data)
# Tier 2: Standard
SSD (recent data - 1 year)
# Tier 3: HDD
storage (archive data - 1-7 years)
# Tier 4:
Tape/Cloud storage (compliance data - 7+ years)
manage_storage_tiers() {
local current_date=$(date +%Y-%m-%d)
# Move data older than 1
year to Tier 2
find /data/tier1 -name "*.dat" -mtime +365 -exec mv {} /data/tier2/ \;
# Move data older than 3
years to Tier 3
find /data/tier2 -name "*.dat" -mtime +1095 -exec mv {} /data/tier3/ \;
# Compress and move to Tier
4
find /data/tier3 -name "*.dat" -mtime +2555 | while read file; do
gzip "$file"
aws s3 cp "${file}.gz"
s3://dw-archive-bucket/$(date +%Y/%m/%d)/
rm "${file}.gz"
done
}
4. Data Compression Strategies:
--
Table compression for archival
ALTER TABLE sales_fact_2020 COMPRESS FOR ARCHIVE HIGH;
ALTER TABLE sales_fact_2021 COMPRESS FOR QUERY HIGH;
ALTER TABLE sales_fact_2022 COMPRESS FOR OLTP;
-- Partition-level
compression
ALTER TABLE sales_fact
MODIFY PARTITION sales_2020_q1 COMPRESS FOR ARCHIVE HIGH;
-- Columnstore
compression for analytics
CREATE COLUMNSTORE INDEX
ix_sales_fact_archive
ON sales_fact_archive
(date_key, customer_key, product_key, sales_amount);
5. Compliance and Legal Hold:
--
Legal hold management
CREATE TABLE legal_holds (
hold_id NUMBER PRIMARY KEY,
case_number VARCHAR2(50),
table_name VARCHAR2(100),
hold_criteria VARCHAR2(500),
hold_start_date DATE,
hold_end_date DATE,
status VARCHAR2(20)
);
-- Compliance
reporting
CREATE OR REPLACE FUNCTION
check_retention_compliance
RETURN VARCHAR2 AS
v_violations NUMBER := 0;
v_status VARCHAR2(20);
BEGIN
-- Check for data exceeding
retention periods
SELECT COUNT(*) INTO v_violations
FROM (
SELECT p.table_name,
p.retention_period_months
FROM
data_retention_policies p
WHERE EXISTS (
SELECT 1 FROM user_tables t
WHERE t.table_name = p.table_name
AND
get_table_oldest_record_age(t.table_name) >
p.retention_period_months
)
);
IF v_violations = 0 THEN
v_status := 'COMPLIANT';
ELSE
v_status := 'VIOLATIONS_FOUND';
END IF;
RETURN v_status;
END;
46. Disaster
Recovery Planning
Answer: DR Planning Framework:
1. Business Impact Analysis:
--
Critical system classification
CREATE TABLE
dr_system_classification (
system_id NUMBER PRIMARY KEY,
system_name VARCHAR2(100),
criticality_level VARCHAR2(20), --
CRITICAL, HIGH, MEDIUM, LOW
rto_hours NUMBER, --
Recovery Time Objective
rpo_hours NUMBER, --
Recovery Point Objective
business_impact_cost NUMBER, -- Cost per hour of
downtime
dependencies VARCHAR2(500),
recovery_priority NUMBER
);
-- Example
classifications
INSERT INTO
dr_system_classification VALUES
(1, 'SALES_DW', 'CRITICAL', 4, 1, 50000, 'CRM, ERP,
Reporting', 1);
INSERT INTO
dr_system_classification VALUES
(2, 'FINANCIAL_DW', 'CRITICAL', 2, 0.5, 100000, 'GL, AP, AR', 1);
2. DR Site Configuration:
#!/bin/bash
# DR site setup
and synchronization
setup_dr_environment() {
local primary_site="$1"
local dr_site="$2"
echo "Setting up DR
environment..."
# Database replication
setup
setup_database_replication "$primary_site" "$dr_site"
# Application server
configuration
setup_application_servers "$dr_site"
# Network configuration
configure_dr_network "$dr_site"
# Storage replication
setup_storage_replication "$primary_site" "$dr_site"
echo "DR environment
setup completed"
}
setup_database_replication() {
local primary="$1"
local standby="$2"
# Oracle Data Guard setup
sqlplus sys/password@$primary as sysdba << EOF
ALTER DATABASE ADD STANDBY LOGFILE GROUP 10
SIZE 100M;
ALTER DATABASE ADD STANDBY LOGFILE GROUP 11
SIZE 100M;
ALTER DATABASE FORCE LOGGING;
-- Create standby controlfile
ALTER DATABASE CREATE STANDBY CONTROLFILE
AS '/backup/standby.ctl';
EOF
# Copy standby controlfile
to DR site
scp /backup/standby.ctl
oracle@$standby:/u01/app/oracle/oradata/
# Start managed recovery
sqlplus sys/password@$standby as sysdba << EOF
STARTUP NOMOUNT;
ALTER DATABASE MOUNT STANDBY DATABASE;
ALTER DATABASE RECOVER MANAGED STANDBY
DATABASE
USING CURRENT LOGFILE DISCONNECT FROM
SESSION;
EOF
}
3. Failover Procedures:
--
Automated failover script
CREATE OR REPLACE PROCEDURE execute_failover AS
v_lag_minutes NUMBER;
v_primary_status VARCHAR2(20);
BEGIN
-- Check primary site
status
v_primary_status :=
check_primary_availability();
IF v_primary_status = 'UNAVAILABLE' THEN
-- Log failover initiation
INSERT INTO dr_log VALUES (
SYSDATE, 'FAILOVER_INITIATED', 'Primary site
unavailable'
);
-- Stop log apply on
standby
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;
-- Activate standby
database
ALTER DATABASE ACTIVATE STANDBY DATABASE;
-- Open database for
read/write
SHUTDOWN IMMEDIATE;
STARTUP;
-- Update application
connection strings
update_connection_strings();
-- Notify operations team
send_notification('DR_FAILOVER_COMPLETED');
-- Log completion
INSERT INTO dr_log VALUES (
SYSDATE, 'FAILOVER_COMPLETED', 'Database activated
successfully'
);
END IF;
END;
4. DR Testing Procedures:
#!/bin/bash
# DR testing
framework
conduct_dr_test() {
local test_type="$1" # PLANNED, UNPLANNED,
PARTIAL
local test_id="DR_TEST_$(date +%Y%m%d_%H%M%S)"
echo "Starting DR
test: $test_id ($test_type)"
# Pre-test validations
validate_dr_environment
case "$test_type" in
"PLANNED")
planned_failover_test "$test_id"
;;
"UNPLANNED")
unplanned_failover_test "$test_id"
;;
"PARTIAL")
partial_system_test "$test_id"
;;
esac
# Post-test cleanup
cleanup_test_environment "$test_id"
# Generate test report
generate_test_report "$test_id"
}
planned_failover_test() {
local test_id="$1"
# Step 1: Verify
replication lag
local lag=$(check_replication_lag)
log_test_step "$test_id" "Replication
lag check: $lag minutes"
# Step 2: Perform
controlled failover
execute_controlled_failover
log_test_step "$test_id" "Controlled
failover executed"
# Step 3: Validate
application connectivity
test_application_connectivity
log_test_step "$test_id" "Application
connectivity tested"
# Step 4: Run data
validation tests
run_data_validation_tests
log_test_step "$test_id" "Data
validation completed"
# Step 5: Failback to
primary
execute_failback_procedure
log_test_step "$test_id" "Failback to
primary completed"
}
5. Recovery Monitoring:
-- DR
monitoring dashboard
CREATE VIEW dr_status_dashboard
AS
SELECT
'Primary Database' as component,
CASE WHEN database_status = 'OPEN' THEN 'AVAILABLE' ELSE 'UNAVAILABLE' END as status,
last_check_time
FROM primary_db_status
UNION ALL
SELECT
'Standby Database' as component,
CASE WHEN apply_lag < 5 THEN 'SYNCHRONIZED' ELSE 'LAG_DETECTED' END as status,
last_check_time
FROM standby_db_status
UNION ALL
SELECT
'Network
Connectivity' as component,
CASE WHEN ping_response < 100 THEN 'OPTIMAL' ELSE 'DEGRADED' END as status,
last_check_time
FROM network_status;
47. Data
Warehouse Documentation
Answer: Documentation Framework:
1. Automated Metadata Documentation:
--
Comprehensive metadata repository
CREATE TABLE dw_documentation (
doc_id NUMBER PRIMARY KEY,
object_type VARCHAR2(50), -- TABLE, VIEW, PROCEDURE,
MAPPING
object_name VARCHAR2(100),
documentation_type VARCHAR2(50), --
DESCRIPTION, BUSINESS_RULES, TECHNICAL_SPECS
content CLOB,
created_by VARCHAR2(50),
created_date DATE,
last_updated DATE,
version VARCHAR2(10)
);
-- Auto-generate
table documentation
CREATE OR REPLACE PROCEDURE
generate_table_documentation AS
BEGIN
FOR table_rec IN (
SELECT table_name,
comments
FROM user_tab_comments
WHERE table_name LIKE '%_FACT' OR table_name LIKE '%_DIM'
) LOOP
-- Generate table overview
INSERT INTO dw_documentation VALUES (
doc_seq.NEXTVAL,
'TABLE',
table_rec.table_name,
'OVERVIEW',
generate_table_overview(table_rec.table_name),
USER,
SYSDATE,
SYSDATE,
'1.0'
);
-- Generate column
specifications
INSERT INTO dw_documentation VALUES (
doc_seq.NEXTVAL,
'TABLE',
table_rec.table_name,
'COLUMN_SPECS',
generate_column_documentation(table_rec.table_name),
USER,
SYSDATE,
SYSDATE,
'1.0'
);
END LOOP;
END;
2. Data Lineage Documentation:
-- Data
lineage tracking
CREATE TABLE
data_lineage_documentation (
lineage_id NUMBER PRIMARY KEY,
source_system VARCHAR2(100),
source_object VARCHAR2(100),
transformation_logic CLOB,
target_object VARCHAR2(100),
business_rules CLOB,
data_quality_rules CLOB,
created_by VARCHAR2(50),
documentation_date DATE
);
-- Generate
lineage documentation
CREATE OR REPLACE FUNCTION
generate_lineage_doc(p_target_table VARCHAR2)
RETURN CLOB AS
v_lineage_doc CLOB;
BEGIN
SELECT
'SOURCE: ' || source_system || CHR(10) ||
'TRANSFORMATION: ' ||
transformation_logic || CHR(10) ||
'BUSINESS RULES: ' || business_rules || CHR(10) ||
'DATA QUALITY: ' || data_quality_rules
INTO v_lineage_doc
FROM
data_lineage_documentation
WHERE target_object = p_target_table;
RETURN v_lineage_doc;
END;
3. Business Glossary:
--
Business terminology management
CREATE TABLE business_glossary (
term_id NUMBER PRIMARY KEY,
business_term VARCHAR2(100),
definition CLOB,
business_context VARCHAR2(200),
technical_implementation VARCHAR2(200),
related_tables VARCHAR2(500),
business_owner VARCHAR2(50),
technical_owner VARCHAR2(50),
approval_status VARCHAR2(20),
effective_date DATE
);
-- Example entries
INSERT INTO business_glossary VALUES (
1,
'Customer Lifetime
Value',
'Total predicted
revenue from a customer over their entire relationship',
'Marketing and Sales
Analytics',
'Calculated in
customer_metrics_fact table using predictive algorithms',
'customer_dim,
sales_fact, customer_metrics_fact',
'VP Marketing',
'Data Team Lead',
'APPROVED',
DATE '2024-01-01'
);
4. Technical Documentation Generation:
#!/bin/bash
# Automated
documentation generation
generate_technical_docs() {
local output_dir="/docs/datawarehouse"
local timestamp=$(date +%Y%m%d)
mkdir -p "$output_dir"
# Database schema
documentation
generate_schema_docs "$output_dir"
# ETL process documentation
generate_etl_docs "$output_dir"
# Data quality
documentation
generate_dq_docs "$output_dir"
# Performance metrics
documentation
generate_performance_docs "$output_dir"
# Create master document
create_master_document "$output_dir" "$timestamp"
echo "Documentation
generated in $output_dir"
}
generate_schema_docs() {
local output_dir="$1"
# Extract schema
information
sqlplus -s user/pass@db << 'EOF' > "$output_dir/schema_documentation.html"
SET PAGESIZE 0
SET MARKUP HTML ON
SELECT
'<h2>Table: ' || table_name ||
'</h2>' ||
'<p>Comments: ' || NVL(comments, 'No
comments') || '</p>' ||
'<table border="1">' ||
'<tr><th>Column</th><th>Data
Type</th><th>Nullable</th><th>Comments</th></tr>'
FROM
user_tab_comments
WHERE table_name
LIKE '%_FACT' OR table_name LIKE '%_DIM';
SELECT
'<tr><td>' || column_name ||
'</td>' ||
'<td>' || data_type ||
CASE WHEN data_length IS NOT NULL THEN '('
|| data_length || ')' END ||
'</td><td>' || nullable ||
'</td>' ||
'<td>' || NVL(comments, '') ||
'</td></tr>'
FROM
user_col_comments c
JOIN
user_tab_columns t ON c.table_name = t.table_name AND c.column_name =
t.column_name
WHERE c.table_name
LIKE '%_FACT' OR c.table_name LIKE '%_DIM'
ORDER BY
c.table_name, t.column_id;
SELECT
'</table><br/>';
EOF
}
5. Documentation Portal:
<!--
Data Warehouse Documentation Portal -->
<!DOCTYPE html>
<html>
<head>
<title>Data Warehouse
Documentation</title>
<style>
.nav-menu { background: #f0f0f0; padding: 10px; }
.content-area { padding: 20px; }
.metadata-table { border-collapse: collapse; width: 100%; }
.metadata-table th, .metadata-table td { border: 1px solid #ddd; padding: 8px; }
</style>
</head>
<body>
<div class="nav-menu">
<a href="#schema">Schema Documentation</a> |
<a href="#etl">ETL Processes</a> |
<a href="#glossary">Business Glossary</a> |
<a href="#lineage">Data Lineage</a> |
<a href="#quality">Data Quality</a>
</div>
<div class="content-area">
<h1>Enterprise Data
Warehouse Documentation</h1>
<p>Last Updated: <span id="lastUpdated">{{TIMESTAMP}}</span></p>
<div id="schema">
<h2>Schema Documentation</h2>
{{SCHEMA_CONTENT}}
</div>
<div id="etl">
<h2>ETL Process
Documentation</h2>
{{ETL_CONTENT}}
</div>
<div id="glossary">
<h2>Business Glossary</h2>
{{GLOSSARY_CONTENT}}
</div>
</div>
</body>
</html>
48. Capacity
Planning and Scaling
Answer: Capacity Planning Framework:
1. Growth Analysis and Forecasting:
--
Historical growth analysis
CREATE VIEW
data_growth_analysis AS
SELECT
table_name,
EXTRACT(YEAR FROM sample_date) as year,
EXTRACT(MONTH FROM sample_date) as month,
row_count,
size_mb,
LAG(row_count) OVER (PARTITION BY table_name ORDER BY sample_date) as prev_row_count,
LAG(size_mb) OVER (PARTITION BY table_name ORDER BY sample_date) as prev_size_mb,
(row_count - LAG(row_count) OVER (PARTITION BY table_name ORDER BY sample_date)) /
LAG(row_count) OVER (PARTITION BY table_name ORDER BY sample_date) * 100 as growth_rate_pct
FROM table_size_history
WHERE sample_date >= ADD_MONTHS(SYSDATE, -24);
-- Predictive
growth modeling
WITH monthly_growth AS (
SELECT
table_name,
AVG(growth_rate_pct) as avg_monthly_growth,
STDDEV(growth_rate_pct) as growth_volatility
FROM
data_growth_analysis
WHERE growth_rate_pct IS NOT NULL
GROUP BY table_name
),
current_size AS (
SELECT
table_name,
row_count as current_rows,
size_mb as current_size_mb
FROM table_size_history
WHERE sample_date = (SELECT MAX(sample_date) FROM table_size_history)
)
SELECT
c.table_name,
c.current_size_mb,
ROUND(c.current_size_mb * POWER(1 +
g.avg_monthly_growth/100, 12), 2) as
projected_size_12_months,
ROUND(c.current_size_mb * POWER(1 +
g.avg_monthly_growth/100, 24), 2) as
projected_size_24_months,
CASE
WHEN
g.avg_monthly_growth > 10 THEN 'HIGH_GROWTH'
WHEN
g.avg_monthly_growth > 5 THEN 'MODERATE_GROWTH'
ELSE 'STABLE'
END as
growth_classification
FROM current_size c
JOIN monthly_growth g ON c.table_name = g.table_name;
2. Resource Utilization Monitoring:
#!/bin/bash
# Comprehensive
resource monitoring
monitor_system_resources() {
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
local log_file="/logs/capacity_monitoring.csv"
# CPU utilization
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
# Memory utilization
local memory_total=$(free -m | awk 'NR==2{printf
"%.0f", $2}')
local memory_used=$(free -m | awk 'NR==2{printf
"%.0f", $3}')
local memory_pct=$(echo "scale=2; $memory_used * 100 / $memory_total" | bc)
# Disk utilization
local disk_usage=$(df -h / | awk 'NR==2{print $5}' | cut -d'%' -f1)
# Database connections
local db_connections=$(netstat -an | grep :1521 | grep ESTABLISHED | wc -l)
# ETL process count
local etl_processes=$(ps aux | grep -c informatica)
# Network throughput
local network_rx=$(cat /proc/net/dev | grep eth0 | awk '{print $2}')
local network_tx=$(cat /proc/net/dev | grep eth0 | awk '{print $10}')
# Log all metrics
echo "$timestamp,$cpu_usage,$memory_pct,$disk_usage,$db_connections,$etl_processes,$network_rx,$network_tx" >> "$log_file"
# Check thresholds and
alert
check_capacity_thresholds "$cpu_usage" "$memory_pct" "$disk_usage"
}
check_capacity_thresholds() {
local cpu="$1"
local memory="$2"
local disk="$3"
# Define thresholds
local cpu_threshold=80
local memory_threshold=85
local disk_threshold=85
# Check and alert
if (( $(echo "$cpu > $cpu_threshold" | bc -l) )); then
send_capacity_alert "CPU" "$cpu" "$cpu_threshold"
fi
if (( $(echo "$memory > $memory_threshold" | bc -l) )); then
send_capacity_alert "Memory" "$memory" "$memory_threshold"
fi
if (( $(echo "$disk > $disk_threshold" | bc -l) )); then
send_capacity_alert "Disk" "$disk" "$disk_threshold"
fi
}
3. Scaling Strategies:
--
Horizontal scaling implementation
CREATE OR REPLACE PROCEDURE
implement_horizontal_scaling AS
BEGIN
-- Partition large tables
by date
FOR table_rec IN (
SELECT table_name,
estimated_size_gb
FROM capacity_analysis
WHERE estimated_size_gb > 100
AND partitioning_status
= 'NOT_PARTITIONED'
) LOOP
-- Create partitioned
version
create_partitioned_table(table_rec.table_name);
-- Migrate data
migrate_to_partitioned_table(table_rec.table_name);
-- Update statistics
gather_partition_statistics(table_rec.table_name);
END LOOP;
END;
-- Vertical
scaling assessment
CREATE OR REPLACE FUNCTION
assess_vertical_scaling_needs
RETURN VARCHAR2 AS
v_recommendation VARCHAR2(1000);
v_cpu_avg NUMBER;
v_memory_avg NUMBER;
v_io_wait_avg NUMBER;
BEGIN
-- Analyze resource
utilization trends
SELECT
AVG(cpu_utilization),
AVG(memory_utilization),
AVG(io_wait_pct)
INTO v_cpu_avg,
v_memory_avg, v_io_wait_avg
FROM
system_performance_log
WHERE log_date >= SYSDATE - 30;
-- Generate recommendations
IF v_cpu_avg > 80 THEN
v_recommendation := v_recommendation || 'CPU upgrade
recommended. ';
END IF;
IF v_memory_avg > 85 THEN
v_recommendation := v_recommendation || 'Memory expansion
needed. ';
END IF;
IF v_io_wait_avg > 20 THEN
v_recommendation := v_recommendation || 'Storage performance
upgrade required. ';
END IF;
RETURN NVL(v_recommendation, 'Current capacity
adequate');
END;
4. Cloud Scaling Automation:
#!/bin/bash
# Cloud
auto-scaling implementation
implement_cloud_autoscaling() {
local cloud_provider="$1" # AWS, Azure, GCP
case "$cloud_provider" in
"AWS")
setup_aws_autoscaling
;;
"Azure")
setup_azure_autoscaling
;;
"GCP")
setup_gcp_autoscaling
;;
esac
}
setup_aws_autoscaling() {
# Create launch template
aws ec2
create-launch-template \
--launch-template-name dw-template \
--launch-template-data '{
"ImageId":
"ami-12345678",
"InstanceType":
"r5.xlarge",
"SecurityGroupIds":
["sg-12345678"],
"UserData": "'"$(base64 <
/scripts/dw-setup.sh)"'"
}'
# Create auto scaling group
aws autoscaling
create-auto-scaling-group \
--auto-scaling-group-name dw-asg \
--launch-template "LaunchTemplateName=dw-template,Version=1" \
--min-size 2 \
--max-size 10 \
--desired-capacity 4 \
--target-group-arns "arn:aws:elasticloadbalancing:region:account:targetgroup/dw-targets"
# Create scaling policies
aws autoscaling
put-scaling-policy \
--auto-scaling-group-name dw-asg \
--policy-name scale-up \
--policy-type
TargetTrackingScaling \
--target-tracking-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification":
{
"PredefinedMetricType": "ASGAverageCPUUtilization"
}
}'
}
5. Capacity Planning Dashboard:
--
Capacity planning metrics view
CREATE VIEW
capacity_planning_dashboard AS
SELECT
'Current Storage
Usage' as metric,
ROUND(SUM(size_gb), 2) as current_value,
'GB' as unit,
ROUND(SUM(projected_size_12m),
2) as projected_12m,
ROUND((SUM(projected_size_12m)
- SUM(size_gb)) / SUM(size_gb) * 100, 1) as growth_pct
FROM
table_capacity_analysis
UNION ALL
SELECT
'Database
Connections',
AVG(connection_count),
'connections',
AVG(connection_count) * 1.5,
50.0
FROM
connection_usage_stats
WHERE sample_date >= SYSDATE - 7
UNION ALL
SELECT
'ETL Processing
Time',
AVG(duration_hours),
'hours',
AVG(duration_hours) * 1.2,
20.0
FROM
etl_performance_stats
WHERE execution_date >= SYSDATE - 30;
49. Data
Warehouse Migration Strategies
Answer: Migration Planning Framework:
1. Migration Assessment:
--
Source system analysis
CREATE TABLE
migration_assessment (
assessment_id NUMBER PRIMARY KEY,
source_system VARCHAR2(100),
object_type VARCHAR2(50),
object_name VARCHAR2(100),
object_size_gb NUMBER,
complexity_score NUMBER, -- 1-10 scale
business_criticality VARCHAR2(20),
migration_priority NUMBER,
estimated_effort_hours NUMBER,
dependencies VARCHAR2(500),
risk_factors VARCHAR2(500)
);
-- Complexity
scoring
CREATE OR REPLACE FUNCTION
calculate_migration_complexity(
p_object_name VARCHAR2,
p_object_type VARCHAR2
) RETURN NUMBER AS
v_complexity NUMBER := 1;
v_column_count NUMBER;
v_index_count NUMBER;
v_constraint_count NUMBER;
BEGIN
IF p_object_type = 'TABLE' THEN
-- Factor in table
characteristics
SELECT COUNT(*) INTO v_column_count
FROM user_tab_columns WHERE table_name = p_object_name;
SELECT COUNT(*) INTO v_index_count
FROM user_indexes WHERE table_name = p_object_name;
SELECT COUNT(*) INTO v_constraint_count
FROM user_constraints WHERE table_name = p_object_name;
-- Calculate complexity
score
v_complexity := LEAST(10,
1 +
CASE WHEN v_column_count > 50 THEN 2 ELSE 0 END +
CASE WHEN v_index_count > 10 THEN 2 ELSE 0 END +
CASE WHEN v_constraint_count > 5 THEN 2 ELSE 0 END
);
END IF;
RETURN v_complexity;
END;
2. Migration Strategies:
Big Bang Migration:
#!/bin/bash
# Complete cutover
migration
execute_bigbang_migration() {
local cutover_date="$1"
local source_env="$2"
local target_env="$3"
echo "Starting Big
Bang migration on $cutover_date"
# Phase 1: Final data sync
echo "Phase 1: Final
data synchronization"
sync_final_data "$source_env" "$target_env"
# Phase 2: Switch off
source systems
echo "Phase 2:
Source system shutdown"
shutdown_source_systems "$source_env"
# Phase 3: Final validation
echo "Phase 3: Data
validation"
validate_migrated_data "$target_env"
# Phase 4: Cutover to
target
echo "Phase 4:
Application cutover"
cutover_applications "$target_env"
# Phase 5: Go-live
verification
echo "Phase 5:
Go-live verification"
verify_golive_success "$target_env"
echo "Big Bang
migration completed"
}
Phased Migration:
#!/bin/bash
# Gradual
migration approach
execute_phased_migration() {
local migration_phases=(
"reference_data:PHASE1"
"historical_data:PHASE2"
"current_transactions:PHASE3"
"real_time_processes:PHASE4"
)
for phase in "${migration_phases[@]}"; do
local data_type=$(echo "$phase" | cut -d':' -f1)
local phase_name=$(echo "$phase" | cut -d':' -f2)
echo "Starting $phase_name: $data_type migration"
# Execute phase-specific
migration
migrate_data_category "$data_type"
# Validate phase completion
validate_phase_completion "$data_type"
# User acceptance testing
conduct_uat "$phase_name"
echo "$phase_name completed
successfully"
sleep 3600 # 1 hour between phases
done
}
3. Data Migration Tools:
--
Migration monitoring and control
CREATE TABLE migration_control (
migration_id NUMBER PRIMARY KEY,
source_object VARCHAR2(100),
target_object VARCHAR2(100),
migration_method VARCHAR2(50), --
BULK_COPY, INCREMENTAL, STREAMING
start_time TIMESTAMP,
end_time TIMESTAMP,
records_migrated NUMBER,
migration_status VARCHAR2(20),
error_count NUMBER,
validation_status VARCHAR2(20)
);
-- Migration
execution procedure
CREATE OR REPLACE PROCEDURE
execute_table_migration(
p_source_table VARCHAR2,
p_target_table VARCHAR2,
p_migration_method VARCHAR2
) AS
v_migration_id NUMBER;
v_start_time TIMESTAMP := SYSTIMESTAMP;
v_records_migrated NUMBER;
BEGIN
-- Initialize migration
tracking
INSERT INTO migration_control (
migration_id, source_object,
target_object,
migration_method, start_time,
migration_status
) VALUES (
migration_seq.NEXTVAL, p_source_table,
p_target_table,
p_migration_method, v_start_time, 'RUNNING'
) RETURNING migration_id INTO v_migration_id;
-- Execute migration based
on method
CASE p_migration_method
WHEN 'BULK_COPY' THEN
execute_bulk_copy(p_source_table,
p_target_table);
WHEN 'INCREMENTAL' THEN
execute_incremental_copy(p_source_table, p_target_table);
WHEN 'STREAMING' THEN
execute_streaming_copy(p_source_table, p_target_table);
END CASE;
-- Update completion status
UPDATE migration_control
SET end_time = SYSTIMESTAMP,
migration_status = 'COMPLETED',
records_migrated = v_records_migrated
WHERE migration_id = v_migration_id;
EXCEPTION
WHEN OTHERS THEN
UPDATE migration_control
SET end_time = SYSTIMESTAMP,
migration_status = 'FAILED',
error_count = 1
WHERE migration_id = v_migration_id;
RAISE;
END;
4. Cloud Migration:
#!/bin/bash
# Cloud-specific
migration procedures
migrate_to_cloud() {
local cloud_provider="$1"
local migration_type="$2" # lift_and_shift,
re_architect, hybrid
case "$cloud_provider" in
"AWS")
migrate_to_aws "$migration_type"
;;
"Azure")
migrate_to_azure "$migration_type"
;;
"GCP")
migrate_to_gcp "$migration_type"
;;
esac
}
migrate_to_aws() {
local migration_type="$1"
case "$migration_type" in
"lift_and_shift")
#
Direct migration to EC2
aws dms
create-replication-instance \
--replication-instance-identifier dw-migration \
--replication-instance-class dms.r5.xlarge
aws dms create-endpoint
\
--endpoint-identifier source-endpoint \
--endpoint-type source \
--engine-name oracle
aws dms create-endpoint
\
--endpoint-identifier target-endpoint \
--endpoint-type target \
--engine-name aurora-postgresql
;;
"re_architect")
#
Migration to cloud-native services
migrate_to_redshift
setup_glue_etl_jobs
configure_quicksight_reporting
;;
esac
}
migrate_to_redshift() {
# Create Redshift cluster
aws redshift
create-cluster \
--cluster-identifier dw-redshift \
--node-type ra3.xlplus \
--number-of-nodes 3 \
--master-username dwadmin \
--master-user-password SecurePassword123
# Setup data pipeline
aws datapipeline
create-pipeline \
--name
dw-migration-pipeline \
--unique-id dw-migration-$(date +%s)
}
5. Rollback and Contingency:
--
Rollback procedures
CREATE OR REPLACE PROCEDURE
execute_migration_rollback(
p_migration_id NUMBER
) AS
v_source_object VARCHAR2(100);
v_target_object VARCHAR2(100);
v_backup_location VARCHAR2(200);
BEGIN
-- Get migration details
SELECT source_object,
target_object
INTO v_source_object,
v_target_object
FROM migration_control
WHERE migration_id = p_migration_id;
-- Restore from backup
v_backup_location := '/backup/pre_migration/' || p_migration_id || '/';
-- Execute rollback
EXECUTE IMMEDIATE 'DROP TABLE ' || v_target_object;
-- Restore original table
EXECUTE IMMEDIATE 'CREATE TABLE ' || v_target_object ||
' AS SELECT * FROM ' || v_backup_location || v_source_object;
-- Update rollback status
UPDATE migration_control
SET migration_status = 'ROLLED_BACK',
end_time = SYSTIMESTAMP
WHERE migration_id = p_migration_id;
-- Log rollback
INSERT INTO migration_log VALUES (
SYSDATE, p_migration_id, 'ROLLBACK_COMPLETED',
'Migration rolled
back successfully'
);
END;
50. Future
Trends in Data Warehousing
Answer: Emerging Technologies and
Trends:
1. Cloud-Native Data Warehousing:
--
Serverless data warehouse example (Snowflake-style)
CREATE WAREHOUSE
compute_wh
WITH
WAREHOUSE_SIZE = 'X-SMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
INITIALLY_SUSPENDED = TRUE
COMMENT = 'Auto-scaling
compute for ETL workloads';
-- Elastic scaling
ALTER WAREHOUSE
compute_wh SET WAREHOUSE_SIZE = 'LARGE';
-- Multi-cluster
warehouses for concurrency
CREATE WAREHOUSE
reporting_wh
WITH
WAREHOUSE_SIZE = 'MEDIUM'
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 10
SCALING_POLICY = 'STANDARD'
AUTO_SUSPEND = 300;
2. Real-Time Analytics and Streaming:
--
Streaming data integration
CREATE STREAM
customer_changes_stream
ON TABLE customers;
-- Real-time
materialized views
CREATE MATERIALIZED VIEW
real_time_sales_summary
BUILD IMMEDIATE
REFRESH FAST ON COMMIT
AS
SELECT
product_category,
sale_date,
SUM(sales_amount) as daily_sales,
COUNT(*) as transaction_count
FROM sales_fact
WHERE sale_date >= TRUNC(SYSDATE)
GROUP BY product_category,
sale_date;
-- Change data
capture integration
CREATE OR REPLACE PROCEDURE
process_streaming_changes AS
BEGIN
-- Process changes from
stream
INSERT INTO sales_fact_realtime
SELECT * FROM
sales_changes_stream
WHERE change_type = 'INSERT';
-- Update real-time
aggregates
REFRESH MATERIALIZED VIEW
real_time_sales_summary;
END;
3. Machine Learning Integration:
--
ML-powered data quality
CREATE OR REPLACE FUNCTION
ml_data_quality_check(
p_table_name VARCHAR2,
p_column_name VARCHAR2
) RETURN VARCHAR2 AS
v_anomaly_score NUMBER;
v_quality_status VARCHAR2(20);
BEGIN
-- Use ML model to detect
anomalies
SELECT
ml_anomaly_detection(column_values)
INTO v_anomaly_score
FROM (SELECT p_column_name as column_values FROM p_table_name);
v_quality_status := CASE
WHEN v_anomaly_score > 0.8 THEN 'ANOMALY_DETECTED'
WHEN v_anomaly_score > 0.6 THEN 'QUALITY_WARNING'
ELSE 'QUALITY_OK'
END;
RETURN v_quality_status;
END;
-- Predictive
analytics in DW
CREATE TABLE ml_predictions (
prediction_id NUMBER PRIMARY KEY,
model_name VARCHAR2(100),
input_features CLOB,
prediction_value NUMBER,
confidence_score NUMBER,
prediction_date TIMESTAMP,
actual_value NUMBER,
model_accuracy NUMBER
);
4. Data Mesh Architecture:
#!/bin/bash
# Data mesh
implementation
setup_data_mesh() {
local domains=("sales" "marketing" "finance" "operations")
for domain in "${domains[@]}"; do
echo "Setting up
data mesh for $domain domain"
# Create domain-specific
data products
create_data_products "$domain"
# Setup domain ownership
assign_domain_ownership "$domain"
# Implement self-serve data
platform
setup_selfserve_platform "$domain"
# Configure federated
governance
configure_federated_governance "$domain"
done
}
create_data_products() {
local domain="$1"
# Create domain-specific
schemas
sqlplus -s admin/pass@db << EOF
CREATE SCHEMA ${domain}_data_products;
-- Domain-specific
fact tables
CREATE TABLE ${domain}_data_products.${domain}_facts (
fact_id NUMBER PRIMARY KEY,
date_key NUMBER,
domain_specific_metrics JSON,
created_timestamp TIMESTAMP DEFAULT
CURRENT_TIMESTAMP
);
-- Domain-specific
APIs
CREATE OR REPLACE
PACKAGE ${domain}_data_api AS
FUNCTION get_metrics(p_date_range VARCHAR2)
RETURN SYS_REFCURSOR;
PROCEDURE
publish_data_product(p_product_spec CLOB);
END;
EOF
}
5. Graph Analytics and Knowledge Graphs:
--
Graph-based data relationships
CREATE TABLE
entity_relationships (
relationship_id NUMBER PRIMARY KEY,
source_entity_type VARCHAR2(50),
source_entity_id VARCHAR2(100),
relationship_type VARCHAR2(50),
target_entity_type VARCHAR2(50),
target_entity_id VARCHAR2(100),
relationship_strength NUMBER,
created_date TIMESTAMP
);
-- Graph traversal
queries
WITH RECURSIVE
entity_network AS (
-- Starting entities
SELECT
source_entity_id as entity_id,
source_entity_type as entity_type,
0 as depth,
CAST(source_entity_id AS VARCHAR(1000)) as path
FROM
entity_relationships
WHERE source_entity_id = 'CUSTOMER_12345'
UNION ALL
-- Connected entities
SELECT
er.target_entity_id,
er.target_entity_type,
en.depth + 1,
en.path || ' -> ' || er.target_entity_id
FROM
entity_relationships er
JOIN entity_network en ON er.source_entity_id
= en.entity_id
WHERE en.depth < 3
)
SELECT * FROM entity_network;
6. Quantum Computing Preparation:
--
Quantum-ready data structures
CREATE TABLE
quantum_optimized_data (
quantum_state_id NUMBER PRIMARY KEY,
classical_data CLOB,
quantum_encoding BLOB,
entanglement_metadata JSON,
quantum_algorithm_hint VARCHAR2(100),
processing_complexity_class VARCHAR2(50)
);
-- Prepare for
quantum speedup in specific algorithms
CREATE OR REPLACE PACKAGE quantum_algorithms AS
-- Quantum database search
preparation
FUNCTION
prepare_grover_search(p_dataset_id NUMBER) RETURN CLOB;
-- Quantum machine learning
data prep
FUNCTION
prepare_qml_dataset(p_features CLOB) RETURN BLOB;
END;
7. Autonomous Data Warehouse:
--
Self-tuning capabilities
CREATE TABLE
autonomous_tuning_log (
tuning_id NUMBER PRIMARY KEY,
detected_issue VARCHAR2(200),
recommended_action CLOB,
action_taken CLOB,
performance_improvement_pct NUMBER,
tuning_timestamp TIMESTAMP,
auto_applied CHAR(1)
);
-- Self-healing
procedures
CREATE OR REPLACE PROCEDURE
autonomous_optimization AS
v_slow_queries SYS_REFCURSOR;
v_index_recommendations SYS_REFCURSOR;
BEGIN
-- Identify performance
issues
OPEN v_slow_queries FOR
SELECT sql_text,
avg_execution_time
FROM
query_performance_log
WHERE avg_execution_time > 10000; -- 10
seconds
-- Generate automatic
recommendations
FOR query_rec IN v_slow_queries LOOP
-- Auto-create missing
indexes
auto_create_recommended_indexes(query_rec.sql_text);
-- Auto-update statistics
auto_update_table_statistics(query_rec.sql_text);
-- Auto-tune SQL
auto_tune_sql_statement(query_rec.sql_text);
END LOOP;
END;
8. Sustainability and Green Computing:
#!/bin/bash
# Green data
warehouse practices
optimize_energy_efficiency() {
echo "Implementing
green DW practices..."
# Workload scheduling
during off-peak hours
schedule_low_priority_jobs_off_peak
# Auto-shutdown of
non-critical resources
implement_auto_shutdown_policies
# Optimize query efficiency
to reduce compute
analyze_and_optimize_resource_intensive_queries
# Carbon footprint tracking
track_carbon_footprint
echo "Green
optimization completed"
}
track_carbon_footprint() {
local cpu_hours=$(get_monthly_cpu_hours)
local storage_gb=$(get_monthly_storage_usage)
local network_gb=$(get_monthly_network_transfer)
# Calculate carbon
footprint (example formula)
local carbon_kg=$(echo "scale=2; ($cpu_hours * 0.5) + ($storage_gb * 0.001) + ($network_gb * 0.01)" | bc)
echo "Monthly carbon
footprint: ${carbon_kg} kg CO2"
# Log for sustainability
reporting
log_sustainability_metrics "$carbon_kg"
}
Future Technology Integration Roadmap:
•
2024-2025: Cloud-native adoption, real-time analytics
•
2026-2027: AI/ML integration, autonomous optimization
•
2028-2029: Quantum computing pilot projects
•
2030+: Fully autonomous, sustainable data ecosystems
Each topic represents the evolution of data warehousing toward more
intelligent, autonomous, and sustainable platforms.
No comments:
Post a Comment