Informatica Q&A

Informatica - Comprehensive Q&A

Transformations & Core Concepts

1. What are the transformations you have worked on?

Answer: Active Transformations (change number of rows):

           Source Qualifier: Data extraction and basic filtering

           Filter: Remove unwanted rows based on conditions

           Router: Route data to multiple targets based on conditions

           Aggregator: Perform calculations (SUM, COUNT, AVG, etc.)

           Rank: Rank data and return top/bottom records

           Sorter: Sort data before processing

           Joiner: Join data from multiple sources

           Union: Combine data from multiple pipelines

           Normalizer: Convert denormalized data to normalized

           SQL: Execute SQL queries within mapping

Passive Transformations (same number of rows):

           Expression: Perform calculations and data manipulation

           Lookup: Reference data from tables/files

           Update Strategy: Determine insert/update/delete operations

           Sequence Generator: Generate unique sequence numbers

           Stored Procedure: Call database stored procedures

           External Procedure: Call external programs

2. Difference between Transaction Control and Hierarchy Parsing

Answer:

Transaction Control

Hierarchy Parsing

Purpose: Controls commit/rollback of transactions

Purpose: Parse hierarchical data structures

Usage: Batch processing, data integrity

Usage: XML, JSON, nested data processing

Commands: TC_COMMIT_BEFORE, TC_COMMIT_AFTER, TC_ROLLBACK_BEFORE, TC_ROLLBACK_AFTER

Functions: XML parsing, path expressions

Scope: Database transaction management

Scope: Data structure transformation

Performance: Controls transaction size for memory

Performance: Handles complex nested data

Transaction Control Example:

IIF(MOD(ROWNUM, 1000) = 0, TC_COMMIT_AFTER, TC_CONTINUE_TRANSACTION)

3. SCD Type 2 Implementation Scenarios

Scenario: Customer address change tracking

Implementation Steps:

1.         Lookup existing records using business key

2.         Compare current vs incoming data

3.         Insert new record for changes

4.         Update old record with end date and current flag

Mapping Components:

           Source: Customer updates

           Lookup: Existing customer dimension

           Expression: Compare and flag changes

           Router: Separate inserts vs updates

           Update Strategy: Set operation flags

           Target: Customer dimension table

Expression Transformation Logic:

-- Change detection
v_changed = IIF(
    CUSTOMER_NAME != lkp_CUSTOMER_NAME OR
    ADDRESS != lkp_ADDRESS OR
    PHONE != lkp_PHONE, 1, 0
)

-- Set operation type
IIF(ISNULL(lkp_CUSTOMER_KEY), DD_INSERT,
    IIF(v_changed = 1, DD_INSERT, DD_REJECT))

4. List of transformations where you have used SCD Type 2

Answer:

1.         Lookup Transformation: To find existing records

2.         Expression Transformation: Date calculations and flagging

3.         Router Transformation: Separate new vs changed records

4.         Update Strategy: Set DD_INSERT, DD_UPDATE flags

5.         Sequence Generator: Generate surrogate keys

6.         Union Transformation: Combine insert and update flows

Advanced SCD Type 2 with Multiple Mappings:

           Mapping 1: Handle inserts for new records

           Mapping 2: Handle updates for existing records

           Mapping 3: Close expired records

5. Target table does not contain primary keys? How will you handle?

Answer: Challenges:

           Duplicate record insertion

           No unique identification

           Data integrity issues

Solutions:

1.         Create composite unique index on business key columns

2.         Use UPSERT operations in session properties

3.         Target Update Override option:

      -- Custom update logic with WHERE clause
UPDATE target_table
SET column1 = :TU.column1,
    column2 = :TU.column2,
    last_updated = SYSDATE
WHERE business_key = :TU.business_key
  AND record_status = 'ACTIVE'
  AND NOT EXISTS (
      SELECT 1 FROM target_table t2
      WHERE t2.business_key = :TU.business_key
      AND t2.load_date > :TU.load_date
  )

4.         Implement custom duplicate checking:

      -- Pre-session SQL
DELETE FROM target_table a
WHERE ROWID > (
    SELECT MIN(ROWID)
    FROM target_table b
    WHERE a.business_key = b.business_key
)

5.         Use staging table with primary key, then load to final target

6.         Session-level configuration:

      Target Properties:
- Treat source rows as: Insert
- Insert Else Update: Yes
- Update as Insert: No

6. What is the use of Hash Key?

Answer: Hash Key Benefits:

           Fast comparison of large composite keys

           Reduced memory usage in lookups

           Improved performance for SCD detection

           Efficient change detection

Implementation:

-- Create hash of multiple columns
MD5(CUSTOMER_ID || '|' || CUSTOMER_NAME || '|' || ADDRESS || '|' || PHONE)

-- Use in SCD Type 2
v_source_hash = MD5(concatenated_source_fields)
v_target_hash = lkp_hash_key

IIF(v_source_hash != v_target_hash, 'CHANGED', 'UNCHANGED')

Hash Functions Available:

           MD5() - 128-bit hash

           SHA1() - 160-bit hash

           SHA2() - 256-bit hash

7. Can MD5 function be used in SCD Type 2?

Answer: Yes, highly recommended for SCD Type 2:

Advantages:

           Single comparison instead of multiple column comparisons

           Performance improvement especially with wide tables

           Memory efficient for caching

           Consistent results across different data types

Implementation Pattern:

-- Source hash
v_src_hash = MD5(CUSTOMER_NAME||'~'||ADDRESS||'~'||PHONE||'~'||EMAIL)

-- Lookup target hash
v_tgt_hash = lkp_hash_key

-- Change detection
v_is_changed = IIF(v_src_hash != v_tgt_hash, 1, 0)

Best Practices:

           Use consistent delimiter (avoid data conflicts)

           Handle NULLs appropriately

           Include all tracking columns in hash

8. Without Update Strategy, how will you implement?

Answer: Alternative Approaches:

1. Using Target Table Properties:

           Set target to “Update as Insert”

           Configure “Insert Else Update” mode

           Define key columns for update detection

2. Pre/Post SQL Commands:

-- Pre-session SQL: Delete existing records
DELETE FROM target_table
WHERE business_key IN (
    SELECT DISTINCT business_key FROM staging_table
)

-- Then insert all records as new

3. Separate Mappings:

           Insert Mapping: Filter for new records

           Update Mapping: Filter for existing records with changes

4. Database Merge Statements:

-- Post-session SQL
MERGE INTO target_table t
USING staging_table s ON (t.key = s.key)
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

9. Can I do Update as Insert in SCD?

Answer: Yes, this is actually the standard approach for SCD Type 2:

Concept: Instead of updating existing records, insert new versions

Configuration:

1.         Target Table Properties: Set “Insert Else Update” to “Insert”

2.         Update Strategy: Use DD_INSERT for all changed records

3.         Separate Process: Handle record closure separately

Benefits:

           Preserves history automatically

           Simpler logic in transformations

           Better performance (inserts vs updates)

Complete SCD Type 2 Process:

Step 1: Insert new versions (Update Strategy = DD_INSERT)
Step 2: Close old versions (separate mapping with DD_UPDATE)

10. What is Target Update Override option?

Answer: Target Update Override allows custom SQL for update operations:

Purpose:

           Custom update logic beyond standard column mapping

           Complex join conditions for updates

           Performance optimization with hints

           Conditional updates

Example:

UPDATE target_table
SET
    column1 = :TU.column1,
    column2 = :TU.column2,
    last_updated = SYSDATE
WHERE
    business_key = :TU.business_key
    AND status = 'ACTIVE'
    AND last_updated < :TU.source_timestamp

When to Use:

           Complex update conditions

           Performance optimization

           Custom business logic

           Bulk update operations

Lookup Transformations & Caching

11. Lookup Cache Details Explain

Answer: Types of Lookup Cache:

1. Static Cache (Default):

           Built once at session start

           No updates during session

           Best for reference data

2. Dynamic Cache:

           Updates during session execution

           Handles data changes in real-time

           Higher memory usage

3. Persistent Cache:

           Saved to disk after session

           Reused across sessions

           Faster subsequent runs

4. Shared Cache:

           Shared across multiple mappings

           Memory optimization

           Consistent data view

Cache Configuration:

Cache Size: Based on available memory
Cache File Directory: High-performance disk
Index Cache Size: 1/8 of data cache

12. Scenario where you have used Dynamic Cache

Answer: Scenario: Real-time customer data loading where same customer can appear multiple times in source

Problem: Static cache won’t reflect intra-session changes

Solution - Dynamic Cache:

1.         Enable Dynamic Cache in lookup properties

2.         Configure NewLookupRow port

3.         Handle cache updates during session

Implementation:

-- In Expression after Lookup
v_customer_key = IIF(ISNULL(lkp_customer_key),
                    NEXTVAL,
                    lkp_customer_key)

-- NewLookupRow port logic
IIF(ISNULL(lkp_customer_key),
    v_customer_key || '|' || CUSTOMER_NAME || '|' || ADDRESS,
    NULL)

Benefits:

           Handles duplicate processing

           Maintains data consistency

           Eliminates lookup failures

13. Will Dynamic Cache handle duplicate records? Have you implemented?

Answer: Yes, Dynamic Cache specifically handles duplicates:

How it works:

1.         First occurrence: Cache miss, inserts into cache

2.         Subsequent occurrences: Cache hit, returns existing key

3.         Automatic updates: Cache reflects new data

Implementation Example:

Lookup Transformation:
- Dynamic Cache = True
- Insert Else Update = True
- Cache File Name Prefix = 'CUSTOMER_CACHE'

NewLookupRow Expression:
CUSTOMER_KEY || '|' || CUSTOMER_NAME || '|' || ADDRESS || '|' || PHONE

Associated Expression:
v_final_key = IIF(ISNULL(lkp_customer_key), NEXTVAL, lkp_customer_key)

Real Project Scenario: Processing daily customer files where same customer appears multiple times with address updates - dynamic cache ensures latest address is captured for all occurrences.

14. Connected vs Unconnected Lookup - Which is faster?

Answer: Connected Lookup is generally faster:

Connected Lookup Advantages:

           Single cache build for entire mapping

           Pipeline processing with data flow

           Better memory utilization

           Participates in pushdown optimization

Unconnected Lookup Advantages:

           Call only when needed (conditional)

           Multiple calls possible in same transformation

           Flexible parameter passing

Performance Comparison:

Connected: O(n) - linear processing
Unconnected: O(n*m) - depends on call frequency

Best Practices:

           Use Connected for mandatory lookups

           Use Unconnected for conditional/complex lookups

           Consider cache type based on data volume

15. Delete records in Informatica - flat file case

Answer: Challenge: Flat files don’t support DELETE operations

Solutions:

1. Full Refresh Approach:

Step 1: Backup existing file
Step 2: Create new file without deleted records
Step 3: Replace original file

2. Staging Table Approach:

Step 1: Load flat file to staging table
Step 2: Perform DELETE operations on staging
Step 3: Export cleaned data back to flat file

3. Filter Transformation:

-- Mark records for deletion
v_delete_flag = IIF(delete_condition, 'D', 'K')

-- Filter out deleted records
Filter Condition: v_delete_flag != 'D'

4. Router Transformation:

Group 1 (Keep): delete_flag != 'Y'
Group 2 (Delete): delete_flag = 'Y' (route to dummy target)

Performance & Optimization

16. Aggregate function without GROUP BY

Answer: Without GROUP BY, Aggregate works on entire dataset:

Example:

Source Data:
DEPT   SALARY
HR     50000
IT     60000
HR     55000
IT     65000

Aggregate (without GROUP BY):
COUNT(*)    SUM(SALARY)    AVG(SALARY)
4           230000         57500

Use Cases:

           Overall statistics calculation

           Data profiling activities

           Validation totals for reconciliation

Implementation:

Aggregate Transformation:
- No GROUP BY ports
- Only aggregate expressions:
  - TOTAL_RECORDS = COUNT(*)
  - TOTAL_AMOUNT = SUM(AMOUNT)
  - AVG_AMOUNT = AVG(AMOUNT)

17. Partitioning Techniques Used

Answer: Partition Types:

1. Round-Robin Partitioning:

           Equal distribution across partitions

           Best for: Uniform data processing

           Use when: No natural partition key

2. Hash Partitioning:

           Based on hash function of key columns

           Best for: Related data grouping

           Use when: Need data locality

3. Key Range Partitioning:

           Based on key value ranges

           Best for: Ordered data processing

           Use when: Sequential processing needed

4. Pass-Through Partitioning:

           Maintains source partitioning

           Best for: Parallel source processing

           Use when: Source already partitioned

Implementation Example:

Session Properties:
Partition Type: Hash
Partition Key: CUSTOMER_ID
Number of Partitions: 4

18. Key Range Partition Example

Answer: Scenario: Process sales data by date ranges

Configuration:

Partition 1: DATE_RANGE = '2024-01-01' TO '2024-03-31' (Q1)
Partition 2: DATE_RANGE = '2024-04-01' TO '2024-06-30' (Q2)
Partition 3: DATE_RANGE = '2024-07-01' TO '2024-09-30' (Q3)
Partition 4: DATE_RANGE = '2024-10-01' TO '2024-12-31' (Q4)

Benefits:

           Parallel processing by quarters

           Reduced data volume per partition

           Better performance for date-based operations

Session Properties:

Partition Type: Key Range
Partition Key: SALE_DATE
Pass Through: No

19. Load only first and last record

Answer: Multiple Approaches:

Method 1 - Using Rank Transformation:

Rank Transformation:
- Group By: CUSTOMER_ID (if per customer)
- Order By: ORDER_DATE
- Top/Bottom: Top 1 and Bottom 1

Router after Rank:
Group 1: RANK = 1 (First Record)
Group 2: RANK = 1 (Last Record) - separate rank transformation with DESC order

Method 2 - Using Aggregator + Joiner:

Aggregator:
- Group By: CUSTOMER_ID
- MIN(ORDER_DATE) as FIRST_DATE
- MAX(ORDER_DATE) as LAST_DATE

Joiner:
Source LEFT JOIN Aggregator
WHERE ORDER_DATE = FIRST_DATE OR ORDER_DATE = LAST_DATE

Method 3 - SQL Override:

SELECT * FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) as rn_first,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn_last
    FROM orders
) WHERE rn_first = 1 OR rn_last = 1

20. Bulk Load with Constraints and Indexes

Answer: Challenges:

           Constraint validation slows inserts

           Index maintenance overhead

           Lock contention issues

Solutions:

1. Disable Constraints/Indexes:

-- Pre-session SQL
ALTER TABLE target_table DISABLE CONSTRAINT constraint_name;
DROP INDEX index_name;

-- Post-session SQL
ALTER TABLE target_table ENABLE CONSTRAINT constraint_name;
CREATE INDEX index_name ON target_table(columns);

2. Bulk Loading Session Properties:

Target Properties:
- Truncate Target Table = Yes
- Bulk Load = Yes
- Enable Constraint Based Loading = No

Database Properties:
- Commit Interval = 10000
- Enable High Precision = No

3. Staging Approach:

Step 1: Bulk load to staging table (no constraints)
Step 2: Validate and cleanse data
Step 3: Insert into final table with constraints

4. Parallel Loading:

- Partition data by natural key
- Use multiple sessions
- Load different partitions simultaneously

21. Normalizer Transformation Usage

Answer: Purpose: Convert single row into multiple rows

Common Scenarios:

1. Column to Row Conversion:

Source (Single Row):
EMP_ID | JAN_SALES | FEB_SALES | MAR_SALES
101    | 10000     | 15000     | 12000

After Normalizer:
EMP_ID | MONTH | SALES
101    | JAN   | 10000
101    | FEB   | 15000
101    | MAR   | 12000

2. Multi-valued Field Processing:

Source:
CUSTOMER_ID | PHONE_NUMBERS
101         | 9876543210,9987654321,9123456789

After Normalizer:
CUSTOMER_ID | PHONE_NUMBER
101         | 9876543210
101         | 9987654321
101         | 9123456789

Configuration:

Normalizer Transformation:
- Number of Occurrences: 3 (for 3 months)
- Generated Column ID (GK): GCID_MONTH
- Generated Instance ID (GI): GI_MONTH

22. Target Load Plan

Answer: Target Load Plan defines loading strategy:

Components:

1.         Load Type: Insert, Update, Delete, Upsert

2.         Integration Service: Which service executes

3.         Source Connection: Database connection info

4.         Target Connection: Target database info

5.         Commit Settings: Transaction control

Load Strategies:

1. Normal Load Plan:

- Session-based execution
- Standard commit intervals
- Error handling through session logs

2. Bulk Load Plan:

- Direct path loading
- Minimal logging
- Faster performance

3. Incremental Load Plan:

- Delta identification
- Change data capture
- Merge/Upsert operations

Configuration Example:

<TargetLoadPlan>
    <LoadType>Insert</LoadType>
    <BulkLoad>true</BulkLoad>
    <CommitInterval>10000</CommitInterval>
    <ErrorThreshold>100</ErrorThreshold>
</TargetLoadPlan>

Advanced Scenarios

23. Load distinct data to Table1 and duplicates to Table2

Answer: Implementation using Router:

Step 1 - Identify Duplicates:

Aggregator Transformation:
- Group By: Business_Key_Columns
- COUNT(*) as RECORD_COUNT

Joiner:
Source INNER JOIN Aggregator ON Business_Key

Step 2 - Route Data:

Router Transformation:
Group 1 (Distinct): RECORD_COUNT = 1 → Table1
Group 2 (Duplicates): RECORD_COUNT > 1 → Table2

Alternative Method - Using Rank:

Rank Transformation:
- Group By: Business_Key
- Order By: Load_Date DESC
- Rank Output: RANK_VALUE

Router:
Group 1: RANK_VALUE = 1 → Table1 (Latest record)
Group 2: RANK_VALUE > 1 → Table2 (Older duplicates)

24. Sequence Generator for 100 rows with Row ID

Answer: Standard Approach:

Sequence Generator Properties:
- Start Value: 1
- Increment By: 1
- End Value: 100
- Cycle: No
- Reset: No

Connect NEXTVAL to target ROW_ID column

Alternative - Expression Transformation:

Variable Port:
$$ROW_COUNT = $$ROW_COUNT + 1

Output Port:
ROW_ID = $$ROW_COUNT

Reset $$ROW_COUNT = 0 in session properties

Advanced - Custom Sequence:

Expression with Custom Logic:
v_sequence =
    IIF(ISNULL(v_prev_sequence), 1, v_prev_sequence + 1)

ROW_ID = 'ROW_' || LPAD(v_sequence, 3, '0')
-- Output: ROW_001, ROW_002, etc.

25. 100 records with 10 cities - City-wise separate data

Answer: Method 1 - Router Transformation:

Router Groups:
Group 1: CITY = 'Mumbai' → Mumbai_Target
Group 2: CITY = 'Delhi' → Delhi_Target
Group 3: CITY = 'Bangalore' → Bangalore_Target
...
Default Group: Other cities → Other_Cities_Target

Method 2 - Dynamic File Creation:

Expression Transformation:
v_filename = 'City_' || CITY || '_' || TO_CHAR(SYSDATE, 'YYYYMMDD') || '.csv'

File Target Properties:
- Output File Name: $PMTargetFileDir/$v_filename
- Create target file at runtime

Method 3 - Using Normalizer:

If cities are in columns:
Source: RECORD_ID, MUMBAI_DATA, DELHI_DATA, BLR_DATA

Normalizer converts to:
RECORD_ID, CITY, DATA

Then route by CITY

26. Shell Script to Transfer Data Between Tables

Answer: Without Informatica - Direct Database Methods:

Method 1 - Using sqlplus (Oracle):

#!/bin/bash

# Set environment variables
export ORACLE_HOME=/path/to/oracle
export PATH=$ORACLE_HOME/bin:$PATH

# Database connection details
DB_USER="username"
DB_PASS="password"
DB_SID="database_sid"

# Execute data transfer
sqlplus -s $DB_USER/$DB_PASS@$DB_SID << EOF
INSERT INTO target_table
SELECT * FROM source_table
WHERE last_updated >= TRUNC(SYSDATE);

COMMIT;
EXIT;
EOF

echo "Data transfer completed at $(date)"

Method 2 - Using Data Pump:

#!/bin/bash

# Export data
expdp username/password@database \
    tables=source_table \
    query=source_table:\"WHERE last_updated >= TRUNC\(SYSDATE\)\" \
    dumpfile=daily_export.dmp \
    directory=DATA_PUMP_DIR

# Import to target
impdp username/password@target_database \
    tables=target_table \
    dumpfile=daily_export.dmp \
    directory=DATA_PUMP_DIR \
    table_exists_action=truncate

Method 3 - Using Database Link:

-- Create database link
CREATE DATABASE LINK target_db_link
CONNECT TO username IDENTIFIED BY password
USING 'target_database_tns';

-- Transfer data
INSERT INTO target_table@target_db_link
SELECT * FROM source_table
WHERE condition;

Error Handling & Debugging

27. File deleted while Informatica job is running

Answer: What Happens:

           Session fails with file not found error

           Partial data processing may occur

           Transaction rollback if configured

Error Messages:

[ERROR] Source file not found: /path/to/file.csv
[ERROR] Failed to read from source
[ERROR] Session terminated due to source error

Prevention Strategies:

1. File Locking:

# In source system
exec 200>/path/to/file.csv.lock
flock -n 200 || exit 1
# Process file
rm /path/to/file.csv.lock

2. File Validation:

# Pre-session validation script
if [ ! -f "$SOURCE_FILE" ]; then
    echo "Source file not found: $SOURCE_FILE"
    exit 1
fi

3. Staging Approach:

1. Copy file to staging directory
2. Process from staging
3. Archive after processing

4. Session Configuration:

Session Properties:
- Stop on Error: No
- Error Threshold: 0 (to capture all errors)
- Recovery Strategy: Restart from beginning

Recovery Process:

1. Check session logs for exact failure point
2. Verify file availability
3. Restart session or mapping
4. Implement monitoring alerts

28. Primary Key vs Unique Key

Answer:

Primary Key

Unique Key

Null Values: Not allowed

Null Values: One NULL allowed

Count: Only one per table

Count: Multiple allowed

Index: Clustered index created

Index: Non-clustered index

Purpose: Uniquely identify rows

Purpose: Ensure uniqueness

Modification: Cannot be changed

Modification: Can be altered

Example:

CREATE TABLE employees (
    emp_id NUMBER PRIMARY KEY,          -- Primary Key
    email VARCHAR2(100) UNIQUE,         -- Unique Key
    phone VARCHAR2(15) UNIQUE,          -- Another Unique Key
    ssn VARCHAR2(11) UNIQUE NOT NULL    -- Unique Key with NOT NULL
);

29. Can I make Unique Key without NULL as constraint instead of Primary Key?

Answer: Yes, you can use UNIQUE NOT NULL as alternative to PRIMARY KEY:

-- Instead of PRIMARY KEY
CREATE TABLE customers (
    customer_id NUMBER UNIQUE NOT NULL,  -- Acts like Primary Key
    customer_name VARCHAR2(100),
    email VARCHAR2(100) UNIQUE
);

-- Add constraint later
ALTER TABLE customers
ADD CONSTRAINT uk_customer_id UNIQUE (customer_id);

ALTER TABLE customers
MODIFY customer_id NOT NULL;

Differences:

           Functionally equivalent for uniqueness and non-null

           No automatic clustering (depends on database)

           Multiple UNIQUE NOT NULL constraints possible

           Referential integrity - can still be referenced by foreign keys

Use Cases:

           When business rules require multiple unique identifiers

           Legacy systems without formal primary keys

           Composite unique constraints

30. What is the use of Primary Key?

Answer: Primary Functions:

1.         Unique Identification: Ensures each row is uniquely identifiable

2.         Referential Integrity: Serves as target for foreign key relationships

3.         Performance Optimization: Automatic index creation for fast access

4.         Data Consistency: Prevents duplicate data entry

5.         Replication: Used in database replication for change tracking

Database-Specific Benefits:

Oracle:

-- Enables efficient ROWID access
-- Supports partition pruning
-- Required for certain replication features

SQL Server:

-- Creates clustered index by default
-- Enables efficient page splits
-- Required for merge replication

In ETL Context:

           SCD Type 2: Surrogate keys for historical tracking

           Incremental Loading: Identify existing records for updates

           Data Quality: Ensure data uniqueness across loads

31. To connect Table A and B, what constraints are needed?

Answer: Primary Constraint - Foreign Key Relationship:

-- Table A (Parent)
CREATE TABLE departments (
    dept_id NUMBER PRIMARY KEY,        -- Primary Key
    dept_name VARCHAR2(100)
);

-- Table B (Child) 
CREATE TABLE employees (
    emp_id NUMBER PRIMARY KEY,         -- Primary Key
    emp_name VARCHAR2(100),
    dept_id NUMBER,                    -- Foreign Key
    CONSTRAINT fk_emp_dept
        FOREIGN KEY (dept_id)
        REFERENCES departments(dept_id)
);

Additional Constraints for Data Integrity:

1.         Check Constraints:

ALTER TABLE employees
ADD CONSTRAINT chk_emp_salary
CHECK (salary > 0);

2.         Not Null Constraints:

ALTER TABLE employees
MODIFY emp_name NOT NULL;

3.         Unique Constraints:

ALTER TABLE employees
ADD CONSTRAINT uk_emp_email
UNIQUE (email);

In Informatica Context:

           Lookup Transformations: Use join keys

           Joiner Transformations: Define master-detail relationships

           Referential Integrity: Ensure valid relationships during ETL

32. Can you implement SCD Type 2 on target flat file?

Answer: Challenges with Flat Files:

           No built-in SCD support

           No update capability

           Sequential processing only

Workaround Solutions:

Method 1 - Multiple File Approach:

Current_Records.csv  - Active records only
History_Records.csv  - All historical versions
Archive_YYYYMMDD.csv - Daily snapshots

Method 2 - Single File with Flags:

CustKey,CustID,Name,Address,StartDate,EndDate,CurrentFlag,Version
1,101,John,Bangalore,2023-01-01,2024-01-15,N,1
2,101,John,Hyderabad,2024-01-16,9999-12-31,Y,2
3,102,Jane,Mumbai,2023-06-01,9999-12-31,Y,1

Implementation Process:

1. Read existing file into staging table
2. Apply SCD Type 2 logic in database
3. Export complete dataset back to file
4. Replace original file

Alternative - Database Approach:

1. Load flat file to temporary table
2. Implement SCD in database using SQL
3. Export final result if flat file needed

33. Update data into Excel and combine with inserted data in SCD2 - Complications?

Answer: Major Complications:

1.         Excel Limitations:

          Row limit: 1,048,576 rows

          No transaction support

          No concurrent access

          No indexing capabilities

2.         SCD Type 2 Challenges:

          Cannot update existing rows in place

          Must rewrite entire file for changes

          Version tracking becomes complex

          Date calculations limited

3.         Performance Issues:

          Large file handling problems

          Memory consumption for big datasets

          Slow processing for complex operations

Solutions:

Approach 1 - Database Staging:

1. Import Excel to staging table
2. Perform SCD operations in database 
3. Export results back to Excel
4. Handle file versioning externally

Approach 2 - Separate Worksheets:

Sheet1: Current_Data (active records)
Sheet2: Historical_Data (all versions)
Sheet3: Change_Log (audit trail)

Implementation:

Expression Transformation:
v_excel_row =
    IIF(SCD_ACTION = 'INSERT',
        CUSTOMER_KEY||','||CUSTOMER_NAME||','||ADDRESS||','||START_DATE||',9999-12-31,Y',
        CUSTOMER_KEY||','||CUSTOMER_NAME||','||ADDRESS||','||START_DATE||','||END_DATE||',N')

34. Format same, 4 files today, 1 file tomorrow - How to take file list in CSV?

Answer: Dynamic File Processing:

Method 1 - Parameter File:

FileDate,FileName,ProcessFlag
2024-01-15,Sales_20240115_File1.csv,Y
2024-01-15,Sales_20240115_File2.csv,Y 
2024-01-15,Sales_20240115_File3.csv,Y
2024-01-15,Sales_20240115_File4.csv,Y
2024-01-16,Sales_20240116_File1.csv,Y

Method 2 - Session Parameters:

Session Properties:
Parameter: $FileList = 'file1.csv,file2.csv,file3.csv,file4.csv'

Source Configuration:
Split($FileList, ',') for multiple sources

Method 3 - Workflow Variables:

Pre-session Assignment:
$FileCount = Get_File_Count('/data/input/', '*.csv')
$FileList = Get_File_Names('/data/input/', 'Sales_' + TO_CHAR(SYSDATE, 'YYYYMMDD'))

Decision Task:
IIF($FileCount > 0, 'PROCESS', 'SKIP')

Implementation Using Unix:

#!/bin/bash
# Generate file list CSV
INPUT_DIR="/data/incoming"
OUTPUT_CSV="/config/file_list.csv"
PROCESS_DATE=$(date +%Y%m%d)

echo "ProcessDate,FileName,FileSize,Status" > $OUTPUT_CSV

find $INPUT_DIR -name "*${PROCESS_DATE}*.csv" -type f | while read file; do
    filename=$(basename "$file")
    filesize=$(stat -c%s "$file")
    echo "${PROCESS_DATE},${filename},${filesize},PENDING" >> $OUTPUT_CSV
done

35. SCD2 Order: Insert, Update as Insert, Update as Update

Answer: Correct Processing Order:

Step 1: Process Updates (Close old records)

-- Update existing records to close them
UPDATE customer_dim
SET end_date = CURRENT_DATE - 1,
    current_flag = 'N'
WHERE customer_id IN (SELECT customer_id FROM staging_changes)
  AND current_flag = 'Y';

Step 2: Insert New Versions (Update as Insert)

-- Insert new versions for changed records
INSERT INTO customer_dim (
    customer_key, customer_id, customer_name, address,
    start_date, end_date, current_flag, version_num
)
SELECT
    customer_dim_seq.NEXTVAL,
    customer_id, customer_name, address,
    CURRENT_DATE, '9999-12-31', 'Y',
    NVL(MAX(version_num), 0) + 1
FROM staging_changes s
GROUP BY customer_id, customer_name, address;

Step 3: Insert New Records

-- Insert completely new customers
INSERT INTO customer_dim (
    customer_key, customer_id, customer_name, address,
    start_date, end_date, current_flag, version_num
)
SELECT
    customer_dim_seq.NEXTVAL,
    customer_id, customer_name, address,
    CURRENT_DATE, '9999-12-31', 'Y', 1
FROM staging_new_customers;

Informatica Implementation Order:

Mapping 1: Close existing records (DD_UPDATE)
Mapping 2: Insert new versions (DD_INSERT)
Mapping 3: Insert new records (DD_INSERT)

Or single mapping with Router:
Group 1: Updates → Update Strategy (DD_UPDATE)
Group 2: New Versions → Update Strategy (DD_INSERT)
Group 3: New Records → Update Strategy (DD_INSERT)

36. Bulk Load with Constraints and Indexes - How to handle?

Answer: Challenges:

           Constraint validation overhead

           Index maintenance during load

           Lock contention issues

           Performance degradation

Solutions:

1. Disable/Enable Approach:

-- Pre-session SQL
ALTER TABLE target_table DISABLE CONSTRAINT constraint_name;
ALTER INDEX index_name UNUSABLE;

-- Post-session SQL 
ALTER TABLE target_table ENABLE CONSTRAINT constraint_name;
ALTER INDEX index_name REBUILD;

2. Bulk Mode Configuration:

Session Properties:
- Enable Bulk Mode: Yes
- Disable Constraints: Yes (if supported)
- Commit Interval: Large value (50,000)
- Enable High Precision: No

Target Properties:
- Truncate Target Table: Yes (if full refresh)
- Output File Type: Binary (for better performance)

3. Staging Table Approach:

-- Create staging table without constraints
CREATE TABLE staging_target AS
SELECT * FROM target_table WHERE 1=0;

-- Bulk load to staging
-- Validate data
-- Insert to main table with constraints
INSERT /*+ APPEND */ INTO target_table
SELECT * FROM staging_target
WHERE validation_rules_pass;

4. Partitioned Loading:

-- Load different partitions separately
-- Enables parallel constraint checking
INSERT INTO target_table PARTITION (p_2024_01)
SELECT * FROM source_data
WHERE date_column >= DATE'2024-01-01'
  AND date_column < DATE'2024-02-01';

37. Key Range Partitioning Example

Answer: Scenario: Sales data partitioned by date ranges for parallel processing

Partition Configuration:

Partition 1: JAN-MAR 2024 (Q1)
  Key Range: '2024-01-01' to '2024-03-31'
 
Partition 2: APR-JUN 2024 (Q2) 
  Key Range: '2024-04-01' to '2024-06-30'
 
Partition 3: JUL-SEP 2024 (Q3)
  Key Range: '2024-07-01' to '2024-09-30'
 
Partition 4: OCT-DEC 2024 (Q4)
  Key Range: '2024-10-01' to '2024-12-31'

Implementation:

Session Properties:
- Partition Type: Key Range
- Partition Key: SALE_DATE
- Number of Partitions: 4

Partition Details:
Partition 1: SALE_DATE >= '2024-01-01' AND SALE_DATE <= '2024-03-31'
Partition 2: SALE_DATE >= '2024-04-01' AND SALE_DATE <= '2024-06-30'
Partition 3: SALE_DATE >= '2024-07-01' AND SALE_DATE <= '2024-09-30'
Partition 4: SALE_DATE >= '2024-10-01' AND SALE_DATE <= '2024-12-31'

Benefits:

           Parallel processing by quarters

           Reduced memory usage per partition

           Better resource utilization

           Improved performance for date-based operations

Use Cases:

           Large historical data processing

           Date-based incremental loads

           Archival data processing

           Report generation by time periods

38. Load only first and last record scenarios

Answer: Multiple Implementation Methods:

Method 1 - Using Rank Transformation:

Source → Sorter (ORDER BY date_column) →
Rank (Group By: customer_id, Top 1) → Router →
Group 1: RANK = 1 (First Record)

Second pipeline:
Source → Sorter (ORDER BY date_column DESC) →
Rank (Group By: customer_id, Top 1) → Router →
Group 1: RANK = 1 (Last Record)

Method 2 - SQL Override in Source Qualifier:

SELECT * FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) as rn_first,
           ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn_last
    FROM orders
) WHERE rn_first = 1 OR rn_last = 1

Method 3 - Using Aggregator + Joiner:

Pipeline 1: Source → Aggregator (Group By customer_id, MIN/MAX dates)
Pipeline 2: Source (all records)
Joiner: Join on customer_id and (date = min_date OR date = max_date)

Oracle SQL for First Row (from your scenario):

-- Get first row for ID=1234
SELECT * FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY ID ORDER BY start_date ASC) as rn
    FROM customer_history
    WHERE ID = 1234
) WHERE rn = 1;

-- Alternative using FIRST_VALUE
SELECT DISTINCT ID,
       FIRST_VALUE(location) OVER (PARTITION BY ID ORDER BY start_date) as first_location,
       FIRST_VALUE(end_date) OVER (PARTITION BY ID ORDER BY start_date) as first_end_date
FROM customer_history
WHERE ID = 1234;

39. Incremental Load Implementation in Informatica

Answer: Common Strategies:

1. Timestamp-Based Incremental:

Source Qualifier SQL:
SELECT * FROM source_table
WHERE last_modified_date > $LAST_EXTRACT_DATE
   OR created_date > $LAST_EXTRACT_DATE

Workflow Variables:
$LAST_EXTRACT_DATE = GET_MAX_DATE('target_table', 'load_date')
$CURRENT_EXTRACT_DATE = SYSDATE

2. Flag-Based Incremental:

Source Qualifier SQL:
SELECT * FROM source_table
WHERE processed_flag = 'N'

Post-session SQL:
UPDATE source_table
SET processed_flag = 'Y',
    processed_date = SYSDATE
WHERE processed_flag = 'N'

3. Sequence-Based CDC:

Source Qualifier SQL:
SELECT * FROM source_table
WHERE sequence_id > $LAST_SEQUENCE_ID
ORDER BY sequence_id

Session Parameters:
$LAST_SEQUENCE_ID = GET_MAX_VALUE('control_table', 'last_sequence')

Implementation Components:

           Control Table: Store extraction metadata

           Workflow Variables: Dynamic parameter passing

           Error Handling: Recovery and restart capabilities

           Validation: Data completeness checks

40. Active vs Passive Transformations with Scenarios

Answer:

Active Transformations (Change number of rows):

1.         Filter: Remove rows based on conditions

      Scenario: Filter out cancelled orders
Condition: ORDER_STATUS != 'CANCELLED'
Input: 1000 rows → Output: 850 rows

2.         Router: Route rows to multiple groups

      Scenario: Route customers by region
Group 1: REGION = 'NORTH' → 300 rows
Group 2: REGION = 'SOUTH' → 400 rows 
Group 3: Default → 300 rows

3.         Aggregator: Group and summarize data

      Scenario: Monthly sales summary
Input: 10,000 daily transactions
Output: 12 monthly summaries

4.         Rank: Return top/bottom records

      Scenario: Top 10 customers by sales
Input: 5,000 customers → Output: 10 customers

Passive Transformations (Same number of rows):

1.         Expression: Calculate derived values

      Scenario: Calculate age from birth_date
v_age = TRUNC((SYSDATE - BIRTH_DATE) / 365.25)
Input: 1000 rows → Output: 1000 rows

2.         Lookup: Retrieve reference data

      Scenario: Get customer name from customer_id
Each input row gets enriched with lookup data
Input: 1000 rows → Output: 1000 rows

3.         Update Strategy: Determine operation type

      Scenario: Flag records for insert/update
DD_INSERT for new records, DD_UPDATE for changes
Input: 1000 rows → Output: 1000 rows (with flags)

41. Sequence Generator for 100 rows with Row ID

Answer: Implementation Options:

Method 1 - Standard Sequence Generator:

Sequence Generator Properties:
- Start Value: 1
- Increment By: 1 
- End Value: 100
- Number of Cached Values: 1000
- Cycle: No
- Reset: No

Connection: NEXTVAL → ROW_ID (target column)

Method 2 - Expression with Variable:

Expression Transformation:
Variable Port: $ROW_COUNT = $ROW_COUNT + 1
Output Port: ROW_ID = $ROW_COUNT

Initialize $ROW_COUNT = 0 in session properties

Method 3 - Custom Formatting:

Expression Transformation:
v_sequence = NEXTVAL (from Sequence Generator)
ROW_ID = 'ROW_' || LPAD(v_sequence, 5, '0')

Output: ROW_00001, ROW_00002, ROW_00003, etc.

Method 4 - Date-based Sequence:

Expression Transformation:
v_date_part = TO_CHAR(SYSDATE, 'YYYYMMDD')
v_sequence = NEXTVAL
ROW_ID = v_date_part || '_' || LPAD(v_sequence, 4, '0')

Output: 20240115_0001, 20240115_0002, etc.

42. 100 records, 10 cities, city-wise separate data

Answer: Multiple Approaches:

Method 1 - Router Transformation:

Source (100 records) → Router Transformation

Router Groups:
Group 1: CITY = 'Mumbai' → Mumbai_Target (10 records)
Group 2: CITY = 'Delhi' → Delhi_Target (10 records)
Group 3: CITY = 'Bangalore' → Bangalore_Target (10 records)
Group 4: CITY = 'Chennai' → Chennai_Target (10 records)
Group 5: CITY = 'Hyderabad' → Hyderabad_Target (10 records)
Group 6: CITY = 'Kolkata' → Kolkata_Target (10 records)
Group 7: CITY = 'Pune' → Pune_Target (10 records)
Group 8: CITY = 'Ahmedabad' → Ahmedabad_Target (10 records)
Group 9: CITY = 'Jaipur' → Jaipur_Target (10 records)
Group 10: CITY = 'Lucknow' → Lucknow_Target (10 records)

Method 2 - Dynamic File Target:

Expression Transformation:
v_filename = 'City_' || CITY || '_' || TO_CHAR(SYSDATE, 'YYYYMMDD') || '.csv'

File Target Properties:
- File Name: $PMTargetFileDir\$v_filename 
- Create target file at runtime: Yes

Output Files:
City_Mumbai_20240115.csv
City_Delhi_20240115.csv
City_Bangalore_20240115.csv, etc.

Method 3 - Database Tables with City Suffix:

Expression Transformation:
v_table_name = 'CUSTOMER_' || UPPER(CITY)

Target Properties:
- Table Name: $v_table_name
- Create target table at runtime: Yes

Tables Created:
CUSTOMER_MUMBAI, CUSTOMER_DELHI, CUSTOMER_BANGALORE, etc.

43. What happens if aggregate function has no GROUP BY?

Answer: Behavior: Processes entire dataset as single group

Example Scenario:

Source Data (Sales Table):
PRODUCT    AMOUNT    QUANTITY
Laptop     50000     1
Mobile     25000     2 
Tablet     30000     1
Laptop     45000     1
Mobile     20000     1

Aggregator (No GROUP BY):
- SUM(AMOUNT) = 170000
- COUNT(*) = 5
- AVG(AMOUNT) = 34000
- MAX(AMOUNT) = 50000
- MIN(AMOUNT) = 20000

Output: Single row with aggregate values

Use Cases:

1.         Overall Statistics: Total sales, record counts

2.         Data Validation: Check data completeness

3.         Control Totals: For reconciliation purposes

4.         Summary Reports: High-level dashboard metrics

Implementation:

Aggregator Transformation:
- No Group By ports
- Aggregate Expressions only:
  - TOTAL_SALES = SUM(AMOUNT) 
  - RECORD_COUNT = COUNT(*)
  - AVERAGE_SALE = AVG(AMOUNT)
  - MAX_SALE = MAX(AMOUNT)
  - MIN_SALE = MIN(AMOUNT)

Practical Application:

Source: Daily transaction file (100,000 records)
Aggregator: No GROUP BY
Output: Single summary record for dashboard

LOAD_DATE | TOTAL_AMOUNT | RECORD_COUNT | AVG_AMOUNT
2024-01-15| 1,250,000    | 100,000      | 12.50

44. Questions on Partitions and Explain Plan

Answer: Partition Types in Informatica:

1. Round-Robin Partitioning:

Use Case: Equal distribution of data
Configuration: No partition key required
Example: 1000 records → 4 partitions = 250 records each

2. Hash Partitioning:

Use Case: Related data grouping
Configuration: Partition key = CUSTOMER_ID
Example: All records for same customer in same partition

3. Key Range Partitioning:

Use Case: Sequential data processing 
Configuration: Partition key = ORDER_DATE
Range 1: 2024-Q1, Range 2: 2024-Q2, etc.

4. Pass-Through Partitioning:

Use Case: Maintain source partitioning
Configuration: Inherit from upstream transformation
Example: Source already partitioned by region

Database Explain Plan Analysis:

-- Oracle Explain Plan
EXPLAIN PLAN FOR
SELECT /*+ USE_HASH(a,b) */
       a.customer_id, b.customer_name, SUM(a.amount)
FROM sales a, customers b
WHERE a.customer_id = b.customer_id
GROUP BY a.customer_id, b.customer_name;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

-- Key metrics to analyze:
-- 1. Cost values
-- 2. Cardinality estimates 
-- 3. Access methods (Index/Full Table Scan)
-- 4. Join methods (Hash/Nested Loop/Sort Merge)
-- 5. Sort operations

Informatica Session Performance Analysis:

Session Log Analysis:
- Partition-wise throughput
- Memory usage per partition
- CPU utilization
- I/O wait times
- Buffer pool statistics

45. Shell Script: Transfer data between tables without Informatica

Answer: Multiple Database Methods:

Method 1 - Direct SQL (Oracle):

#!/bin/bash

# Database connection details
DB_USER="etl_user"
DB_PASS="password" 
DB_SID="PRODDB"

# Transfer data using sqlplus
sqlplus -s $DB_USER/$DB_PASS@$DB_SID << EOF
-- Direct table to table transfer
INSERT INTO target_schema.customer_target
SELECT customer_id, customer_name, address, phone, email, SYSDATE
FROM source_schema.customer_source
WHERE last_modified_date >= TRUNC(SYSDATE);

-- Show transfer statistics
SELECT 'Transferred ' || SQL%ROWCOUNT || ' records' as status FROM dual;

COMMIT;
EXIT;
EOF

echo "Data transfer completed at $(date)"

Method 2 - Using Database Link:

#!/bin/bash

sqlplus -s $DB_USER/$DB_PASS@$TARGET_DB << EOF
-- Create database link if not exists
CREATE DATABASE LINK source_db_link
CONNECT TO source_user IDENTIFIED BY source_pass
USING 'source_database_tns';

-- Transfer data via database link
INSERT INTO local_target_table
SELECT * FROM source_table@source_db_link
WHERE process_date = TRUNC(SYSDATE);

COMMIT;
EXIT;
EOF

Method 3 - Export/Import Approach:

#!/bin/bash

# Export from source
expdp source_user/password@source_db \
    tables=source_schema.sales_data \
    query=sales_data:\"WHERE sale_date >= TRUNC\(SYSDATE\)\" \
    dumpfile=daily_sales.dmp \
    directory=DATA_PUMP_DIR \
    logfile=export_sales.log

# Import to target 
impdp target_user/password@target_db \
    tables=target_schema.sales_data \
    dumpfile=daily_sales.dmp \
    directory=DATA_PUMP_DIR \
    table_exists_action=append \
    logfile=import_sales.log

echo "Export/Import completed successfully"

Method 4 - Using MERGE Statement:

-- Advanced MERGE for UPSERT operations
MERGE INTO target_table t
USING (
    SELECT customer_id, customer_name, address, phone, email
    FROM source_table
    WHERE last_modified_date >= TRUNC(SYSDATE)
) s ON (t.customer_id = s.customer_id)
WHEN MATCHED THEN
    UPDATE SET
        t.customer_name = s.customer_name,
        t.address = s.address,
        t.phone = s.phone,
        t.email = s.email,
        t.last_updated = SYSDATE
WHEN NOT MATCHED THEN
    INSERT (customer_id, customer_name, address, phone, email, created_date)
    VALUES (s.customer_id, s.customer_name, s.address, s.phone, s.email, SYSDATE);

46. Have you used Normalizer? Provide scenarios

Answer: Normalizer Use Cases:

Scenario 1 - Column to Row Conversion:

Source Data (Wide Format):
EMP_ID | JAN_SALES | FEB_SALES | MAR_SALES | APR_SALES
101    | 10000     | 15000     | 12000     | 18000
102    | 8000      | 9000      | 11000     | 14000

Normalizer Configuration:
- Occurs: 4 (for 4 months)
- Generated Column ID: GCID_MONTH
- Generated Instance ID: GI_MONTH

Output (Normalized):
EMP_ID | MONTH | SALES | GCID_MONTH | GI_MONTH
101    | JAN   | 10000 | 1          | 1
101    | FEB   | 15000 | 1          | 2 
101    | MAR   | 12000 | 1          | 3
101    | APR   | 18000 | 1          | 4
102    | JAN   | 8000  | 1          | 1
102    | FEB   | 9000  | 1          | 2
102    | MAR   | 11000 | 1          | 3 
102    | APR   | 14000 | 1          | 4

Scenario 2 - Multi-valued Field Processing:

Source Data:
CUSTOMER_ID | PHONE_NUMBERS
101         | 9876543210;9987654321;9123456789
102         | 9555666777;9444333222

Normalizer Configuration:
- Occurs: 3 (maximum phones per customer)
- Level: 0 (for variable occurrences)

Output:
CUSTOMER_ID | PHONE_NUMBER | GI_PHONE
101         | 9876543210   | 1
101         | 9987654321   | 2
101         | 9123456789   | 3
102         | 9555666777   | 1
102         | 9444333222   | 2
102         | NULL         | 3

Scenario 3 - COBOL Copybook Processing:

Source: COBOL file with repeating groups
CUSTOMER-RECORD
    CUSTOMER-ID
    CUSTOMER-NAME 
    ORDER-DETAILS OCCURS 5 TIMES
        ORDER-ID
        ORDER-DATE
        ORDER-AMOUNT

Normalizer extracts each ORDER-DETAILS occurrence as separate row

Implementation Tips:

           Set appropriate OCCURS value based on maximum repetitions

           Handle NULL values for unused occurrences

           Use GI (Generated Instance ID) for sequence tracking

           Combine with Expression for data cleanup

47. Connected vs Unconnected Lookup - Performance comparison

Answer: Performance Analysis:

Connected Lookup:

Advantages:
✓ Single cache build for entire mapping
✓ Pipeline processing with data flow
✓ Better memory utilization
✓ Automatic caching optimization
✓ Supports multiple return values efficiently

Performance Characteristics:
- Cache built once at session start
- O(n) processing complexity
- Memory usage: Single cache instance
- Best for: Mandatory lookups, multiple return columns

Unconnected Lookup:

Advantages:
✓ Called only when needed (conditional)
✓ Can be called multiple times with different parameters
✓ Flexible parameter passing
✓ Selective cache usage

Performance Characteristics: 
- Cache built on first call
- O(n*m) where m = number of calls
- Memory usage: Separate cache per lookup
- Best for: Optional lookups, single return values

Performance Comparison Test:

Test Scenario: 1 million source records, 100,000 lookup records

Connected Lookup:
- Cache build time: 30 seconds
- Processing time: 5 minutes
- Total time: 5.5 minutes
- Memory usage: 512 MB

Unconnected Lookup (called for every record):
- Cache build time: 30 seconds 
- Processing time: 12 minutes
- Total time: 12.5 minutes
- Memory usage: 512 MB per lookup instance

Unconnected Lookup (conditional - 30% records):
- Cache build time: 30 seconds
- Processing time: 4 minutes 
- Total time: 4.5 minutes
- Memory usage: 512 MB

Recommendation:

           Use Connected for lookups needed by all/most records

           Use Unconnected for conditional or selective lookups

           Consider cache size and available memory

           Test with actual data volumes for best decision

48. Data Quality and Validation Transformations

Answer: Data Quality Scenarios:

1. Address Standardization:

Data Quality Transformation:
Input: "123 main st apt 4b"
Rule: US Address Standardization
Output: "123 Main Street Apartment 4B"

Configuration:
- Reference Data: USPS database
- Standardization Level: Parsing + Correction
- Output Ports: Standardized address components

2. Name Matching and Deduplication:

Scenario: Customer deduplication across systems

Data Quality Rule:
- Phonetic matching (Soundex, Double Metaphone)
- Fuzzy string matching
- Composite matching score

Implementation:
Input: John Smith, Jon Smyth, J. Smith
Process: Calculate similarity scores
Output: Matched groups with confidence levels

3. Email Validation:

Expression Transformation:
v_email_valid =
    IIF(INSTR(EMAIL, '@') > 0 AND
        INSTR(EMAIL, '.') > INSTR(EMAIL, '@') AND
        LENGTH(EMAIL) > 5 AND
        EMAIL NOT LIKE '% %', 'VALID', 'INVALID')

Router Groups:
- Valid Emails → Main processing
- Invalid Emails → Error handling

4. Phone Number Standardization:

Expression Transformation:
-- Remove all non-numeric characters
v_clean_phone = REPLACESTR(0, PHONE_NUMBER, '[^0-9]', '')

-- Format based on length
v_formatted_phone =
    IIF(LENGTH(v_clean_phone) = 10,
        '(' || SUBSTR(v_clean_phone, 1, 3) || ') ' ||
        SUBSTR(v_clean_phone, 4, 3) || '-' ||
        SUBSTR(v_clean_phone, 7, 4),
        'INVALID')

49. Workflow Design Patterns and Best Practices

Answer: Common Workflow Patterns:

1. Master-Detail Loading Pattern:

Workflow: WF_CUSTOMER_ORDER_LOAD
├── Session: S_LOAD_CUSTOMERS (Master)
├── Decision: Check customer load success 
├── Session: S_LOAD_ORDERS (Detail)
└── Session: S_UPDATE_CONTROL_TABLE

Dependencies:
- Orders load only if customers succeed
- Foreign key integrity maintained
- Error handling at each level

2. Incremental with Fallback Pattern:

Workflow: WF_INCREMENTAL_LOAD
├── Decision: D_CHECK_INCREMENTAL_FLAG
├── Branch 1: Incremental Load
   ├── Session: S_INCREMENTAL_EXTRACT
   └── Session: S_INCREMENTAL_TRANSFORM
├── Branch 2: Full Load (if incremental fails)
   ├── Session: S_FULL_EXTRACT 
   └── Session: S_FULL_TRANSFORM
└── Session: S_UPDATE_STATISTICS

3. Parallel Processing Pattern:

Workflow: WF_PARALLEL_REGIONS
├── Start
├── Concurrent Sessions:
   ├── S_LOAD_REGION_NORTH
   ├── S_LOAD_REGION_SOUTH 
   ├── S_LOAD_REGION_EAST
   └── S_LOAD_REGION_WEST
├── Wait (for all sessions to complete)
└── S_CONSOLIDATE_RESULTS

Benefits:
- Reduced total processing time
- Better resource utilization
- Independent error handling per region

4. Data Validation Framework:

Workflow: WF_DATA_VALIDATION
├── Session: S_SOURCE_PROFILING
├── Decision: D_DATA_QUALITY_CHECK
├── Branch 1: Quality Passed
   └── Session: S_MAIN_PROCESSING
├── Branch 2: Quality Failed
   ├── Session: S_ERROR_LOGGING
   └── Session: S_SEND_ALERT
└── Session: S_FINAL_REPORTING

50. Advanced PowerExchange and Real-time Processing

Answer: PowerExchange Capabilities:

1. Real-time CDC (Change Data Capture):

Configuration:
Source: Oracle Database with PowerExchange Agent
Capture: Log-based CDC using archive logs
Target: Real-time warehouse updates

PowerExchange Settings:
- Capture Mode: Continuous
- Log Mining: Automatic
- Checkpoint Frequency: Every 100 changes
- Recovery: Automatic restart from checkpoint

2. Mainframe Data Integration:

Scenario: Extract VSAM files from mainframe

PowerExchange Configuration:
- Platform: z/OS mainframe
- Data Format: COBOL copybook
- Transfer Method: TCP/IP or FTP
- Scheduling: Event-driven or time-based

Mapping Components:
Source: PowerExchange Source
Transformation: COBOL data type conversions
Target: Relational database

3. SAP Integration:

PowerExchange for SAP:
- Source: SAP R/3 tables and views
- Extraction: Real-time or batch
- Delta Processing: Built-in change capture
- Authentication: SAP user credentials

Implementation:
1. Install PowerExchange Agent on SAP system
2. Configure RFC connections
3. Create source definitions for SAP objects
4. Design mappings with SAP-specific transformations

4. Message Queue Integration:

Scenario: Process messages from IBM MQ

PowerExchange Configuration:
- Queue Manager: Connect to MQ queue manager
- Message Format: XML, JSON, or fixed-width
- Processing Mode: Continuous monitoring
- Error Handling: Dead letter queue for failures

Real-time Processing:
Source: PowerExchange Message Queue
Transform: Parse message content
Target: Database table or file
Monitoring: Queue depth and processing latency

Benefits of PowerExchange:

           Native connectivity to various platforms

           Real-time processing capabilities

           Automatic metadata import

           Built-in data type conversions

           High performance for large volumes

           Minimal impact on source systems


No comments:

Post a Comment

Complete Data Engineering & BI Interview Preparation Guides

Complete Data Engineering & BI Interview Preparation Guides 📌 Data Engineering &...