Informatica - Comprehensive Q&A
Transformations & Core
Concepts
1. What are the
transformations you have worked on?
Answer: Active Transformations (change number of rows):
•
Source Qualifier: Data extraction and basic filtering
•
Filter: Remove unwanted rows based on conditions
•
Router: Route data to multiple targets based on conditions
•
Aggregator: Perform calculations (SUM, COUNT, AVG, etc.)
•
Rank: Rank data and return top/bottom records
•
Sorter: Sort data before processing
•
Joiner: Join data from multiple sources
•
Union: Combine data from multiple pipelines
•
Normalizer: Convert denormalized data to normalized
•
SQL: Execute SQL queries within mapping
Passive
Transformations (same number of rows):
•
Expression: Perform calculations and data manipulation
•
Lookup: Reference data from tables/files
•
Update Strategy: Determine insert/update/delete operations
•
Sequence Generator: Generate unique sequence numbers
•
Stored Procedure: Call database stored procedures
•
External Procedure: Call external programs
2. Difference
between Transaction Control and Hierarchy Parsing
Answer:
Transaction
Control |
Hierarchy
Parsing |
Purpose: Controls commit/rollback of transactions |
Purpose: Parse hierarchical data structures |
Usage: Batch processing, data integrity |
Usage: XML, JSON, nested data processing |
Commands: TC_COMMIT_BEFORE, TC_COMMIT_AFTER, TC_ROLLBACK_BEFORE,
TC_ROLLBACK_AFTER |
Functions: XML parsing, path expressions |
Scope: Database transaction management |
Scope: Data structure transformation |
Performance: Controls transaction size for memory |
Performance: Handles complex nested data |
Transaction
Control Example:
IIF(MOD(ROWNUM, 1000) = 0,
TC_COMMIT_AFTER, TC_CONTINUE_TRANSACTION)
3. SCD Type 2
Implementation Scenarios
Scenario: Customer address change tracking
Implementation
Steps:
1.
Lookup existing records using business key
2.
Compare current vs incoming
data
3.
Insert new record for changes
4.
Update old record with end date and current flag
Mapping
Components:
•
Source: Customer updates
•
Lookup: Existing customer dimension
•
Expression: Compare and flag changes
•
Router: Separate inserts vs updates
•
Update Strategy: Set operation flags
•
Target: Customer dimension table
Expression
Transformation Logic:
-- Change detection
v_changed = IIF(
CUSTOMER_NAME != lkp_CUSTOMER_NAME OR
ADDRESS != lkp_ADDRESS OR
PHONE != lkp_PHONE, 1, 0
)
-- Set operation type
IIF(ISNULL(lkp_CUSTOMER_KEY), DD_INSERT,
IIF(v_changed = 1, DD_INSERT, DD_REJECT))
4. List of
transformations where you have used SCD Type 2
Answer:
1.
Lookup Transformation: To find existing records
2.
Expression Transformation: Date calculations and flagging
3.
Router Transformation: Separate new vs changed records
4.
Update Strategy: Set DD_INSERT, DD_UPDATE flags
5.
Sequence Generator: Generate surrogate keys
6.
Union Transformation: Combine insert and update flows
Advanced SCD Type
2 with Multiple Mappings:
•
Mapping 1: Handle inserts for new records
•
Mapping 2: Handle updates for existing records
•
Mapping 3: Close expired records
5. Target table
does not contain primary keys? How will you handle?
Answer: Challenges:
•
Duplicate record insertion
•
No unique identification
•
Data integrity issues
Solutions:
1.
Create composite unique
index on business key columns
2.
Use UPSERT operations in session properties
3.
Target Update Override
option:
-- Custom update logic with WHERE clause
UPDATE target_table
SET column1 = :TU.column1,
column2 = :TU.column2,
last_updated = SYSDATE
WHERE business_key = :TU.business_key
AND record_status = 'ACTIVE'
AND NOT EXISTS (
SELECT 1 FROM target_table t2
WHERE t2.business_key = :TU.business_key
AND t2.load_date > :TU.load_date
)
4.
Implement custom duplicate
checking:
-- Pre-session SQL
DELETE FROM target_table a
WHERE ROWID > (
SELECT MIN(ROWID)
FROM target_table b
WHERE a.business_key = b.business_key
)
5.
Use staging table with primary key, then load to final target
6.
Session-level configuration:
Target Properties:
- Treat source rows as: Insert
- Insert Else Update: Yes
- Update as Insert: No
6. What is the
use of Hash Key?
Answer: Hash Key Benefits:
•
Fast comparison of large composite keys
•
Reduced memory usage in lookups
•
Improved performance for SCD detection
•
Efficient change detection
Implementation:
-- Create hash of multiple columns
MD5(CUSTOMER_ID || '|' || CUSTOMER_NAME ||
'|' || ADDRESS || '|' || PHONE)
-- Use in SCD Type 2
v_source_hash =
MD5(concatenated_source_fields)
v_target_hash = lkp_hash_key
IIF(v_source_hash != v_target_hash,
'CHANGED', 'UNCHANGED')
Hash Functions
Available:
•
MD5() - 128-bit hash
•
SHA1() - 160-bit hash
•
SHA2() - 256-bit hash
7. Can MD5
function be used in SCD Type 2?
Answer: Yes, highly recommended for SCD Type 2:
Advantages:
•
Single comparison instead of multiple column comparisons
•
Performance improvement especially with wide tables
•
Memory efficient for caching
•
Consistent results across different data types
Implementation
Pattern:
-- Source hash
v_src_hash =
MD5(CUSTOMER_NAME||'~'||ADDRESS||'~'||PHONE||'~'||EMAIL)
-- Lookup target hash
v_tgt_hash = lkp_hash_key
-- Change detection
v_is_changed = IIF(v_src_hash !=
v_tgt_hash, 1, 0)
Best Practices:
•
Use consistent delimiter (avoid
data conflicts)
•
Handle NULLs appropriately
•
Include all tracking columns in
hash
8. Without
Update Strategy, how will you implement?
Answer: Alternative Approaches:
1. Using Target
Table Properties:
•
Set target to “Update as
Insert”
•
Configure “Insert Else Update”
mode
•
Define key columns for update
detection
2. Pre/Post SQL
Commands:
-- Pre-session SQL: Delete
existing records
DELETE FROM target_table
WHERE business_key IN (
SELECT DISTINCT business_key FROM staging_table
)
-- Then insert all
records as new
3. Separate
Mappings:
•
Insert Mapping: Filter for new records
•
Update Mapping: Filter for existing records with changes
4. Database Merge
Statements:
-- Post-session SQL
MERGE INTO target_table t
USING staging_table s ON (t.key = s.key)
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
9. Can I do
Update as Insert in SCD?
Answer: Yes, this is actually the standard approach for SCD Type 2:
Concept: Instead of updating existing records, insert new versions
Configuration:
1.
Target Table Properties: Set “Insert Else Update” to “Insert”
2.
Update Strategy: Use DD_INSERT for all changed records
3.
Separate Process: Handle record closure separately
Benefits:
•
Preserves history automatically
•
Simpler logic in transformations
•
Better performance (inserts vs updates)
Complete SCD Type
2 Process:
Step 1: Insert new versions (Update
Strategy = DD_INSERT)
Step 2: Close old versions (separate
mapping with DD_UPDATE)
10. What is
Target Update Override option?
Answer: Target Update Override allows custom SQL for update operations:
Purpose:
•
Custom update logic beyond standard column mapping
•
Complex join conditions for updates
•
Performance optimization with hints
•
Conditional updates
Example:
UPDATE target_table
SET
column1 = :TU.column1,
column2 = :TU.column2,
last_updated = SYSDATE
WHERE
business_key = :TU.business_key
AND status = 'ACTIVE'
AND last_updated < :TU.source_timestamp
When to Use:
•
Complex update conditions
•
Performance optimization
•
Custom business logic
•
Bulk update operations
Lookup Transformations
& Caching
11. Lookup Cache
Details Explain
Answer: Types of Lookup Cache:
1. Static Cache
(Default):
•
Built once at session start
•
No updates during session
•
Best for reference data
2. Dynamic Cache:
•
Updates during session
execution
•
Handles data changes in
real-time
•
Higher memory usage
3. Persistent
Cache:
•
Saved to disk after session
•
Reused across sessions
•
Faster subsequent runs
4. Shared Cache:
•
Shared across multiple mappings
•
Memory optimization
•
Consistent data view
Cache
Configuration:
Cache Size: Based on available memory
Cache File Directory: High-performance
disk
Index Cache Size: 1/8 of data cache
12. Scenario
where you have used Dynamic Cache
Answer: Scenario: Real-time customer data loading where same
customer can appear multiple times in source
Problem: Static cache won’t reflect intra-session changes
Solution -
Dynamic Cache:
1.
Enable Dynamic Cache in lookup properties
2.
Configure NewLookupRow port
3.
Handle cache updates during session
Implementation:
-- In Expression after Lookup
v_customer_key =
IIF(ISNULL(lkp_customer_key),
NEXTVAL,
lkp_customer_key)
-- NewLookupRow port logic
IIF(ISNULL(lkp_customer_key),
v_customer_key || '|' || CUSTOMER_NAME || '|' || ADDRESS,
NULL)
Benefits:
•
Handles duplicate processing
•
Maintains data consistency
•
Eliminates lookup failures
13. Will Dynamic
Cache handle duplicate records? Have you implemented?
Answer: Yes, Dynamic Cache specifically handles duplicates:
How it works:
1.
First occurrence: Cache miss, inserts into cache
2.
Subsequent occurrences: Cache hit, returns existing key
3.
Automatic updates: Cache reflects new data
Implementation
Example:
Lookup Transformation:
- Dynamic Cache = True
- Insert Else Update = True
- Cache File Name Prefix =
'CUSTOMER_CACHE'
NewLookupRow Expression:
CUSTOMER_KEY || '|' || CUSTOMER_NAME ||
'|' || ADDRESS || '|' || PHONE
Associated Expression:
v_final_key =
IIF(ISNULL(lkp_customer_key), NEXTVAL, lkp_customer_key)
Real Project
Scenario: Processing daily customer files where
same customer appears multiple times with address updates - dynamic cache
ensures latest address is captured for all occurrences.
14. Connected vs
Unconnected Lookup - Which is faster?
Answer: Connected Lookup is generally faster:
Connected Lookup
Advantages:
•
Single cache build for entire mapping
•
Pipeline processing with data flow
•
Better memory utilization
•
Participates in pushdown
optimization
Unconnected
Lookup Advantages:
•
Call only when needed (conditional)
•
Multiple calls possible in same transformation
•
Flexible parameter passing
Performance
Comparison:
Connected: O(n) - linear processing
Unconnected: O(n*m) - depends on call
frequency
Best Practices:
•
Use Connected for
mandatory lookups
•
Use Unconnected for
conditional/complex lookups
•
Consider cache type
based on data volume
15. Delete
records in Informatica - flat file case
Answer: Challenge: Flat files don’t support DELETE operations
Solutions:
1. Full Refresh
Approach:
Step 1: Backup existing file
Step 2: Create new file without deleted
records
Step 3: Replace original file
2. Staging Table
Approach:
Step 1: Load flat file to staging table
Step 2: Perform DELETE operations on
staging
Step 3: Export cleaned data back to flat
file
3. Filter
Transformation:
-- Mark records for deletion
v_delete_flag = IIF(delete_condition, 'D',
'K')
-- Filter out deleted records
Filter Condition: v_delete_flag != 'D'
4. Router
Transformation:
Group 1 (Keep): delete_flag != 'Y'
Group 2 (Delete): delete_flag = 'Y' (route
to dummy target)
Performance & Optimization
16. Aggregate
function without GROUP BY
Answer: Without GROUP BY, Aggregate
works on entire dataset:
Example:
Source Data:
DEPT SALARY
HR 50000
IT 60000
HR 55000
IT 65000
Aggregate
(without GROUP BY):
COUNT(*) SUM(SALARY) AVG(SALARY)
4 230000 57500
Use Cases:
•
Overall statistics calculation
•
Data profiling activities
•
Validation totals for reconciliation
Implementation:
Aggregate
Transformation:
- No GROUP BY
ports
- Only aggregate
expressions:
- TOTAL_RECORDS = COUNT(*)
- TOTAL_AMOUNT = SUM(AMOUNT)
- AVG_AMOUNT = AVG(AMOUNT)
17. Partitioning
Techniques Used
Answer: Partition Types:
1. Round-Robin Partitioning:
•
Equal distribution across partitions
•
Best for: Uniform data processing
•
Use when: No natural partition key
2. Hash Partitioning:
•
Based on hash function of key columns
•
Best for: Related data grouping
•
Use when: Need data locality
3. Key Range Partitioning:
•
Based on key value ranges
•
Best for: Ordered data processing
•
Use when: Sequential processing needed
4. Pass-Through Partitioning:
•
Maintains source
partitioning
•
Best for: Parallel source processing
•
Use when: Source already partitioned
Implementation Example:
Session
Properties:
Partition Type:
Hash
Partition Key:
CUSTOMER_ID
Number of
Partitions: 4
18. Key Range
Partition Example
Answer: Scenario: Process sales
data by date ranges
Configuration:
Partition 1:
DATE_RANGE = '2024-01-01' TO '2024-03-31' (Q1)
Partition 2:
DATE_RANGE = '2024-04-01' TO '2024-06-30' (Q2)
Partition 3:
DATE_RANGE = '2024-07-01' TO '2024-09-30' (Q3)
Partition 4:
DATE_RANGE = '2024-10-01' TO '2024-12-31' (Q4)
Benefits:
•
Parallel processing by quarters
•
Reduced data volume per partition
•
Better performance for date-based operations
Session Properties:
Partition Type:
Key Range
Partition Key:
SALE_DATE
Pass Through: No
19. Load only
first and last record
Answer: Multiple Approaches:
Method 1 - Using Rank Transformation:
Rank
Transformation:
- Group By:
CUSTOMER_ID (if per customer)
- Order By:
ORDER_DATE
- Top/Bottom: Top
1 and Bottom 1
Router after
Rank:
Group 1: RANK = 1
(First Record)
Group 2: RANK = 1
(Last Record) - separate rank transformation with DESC order
Method 2 - Using Aggregator + Joiner:
Aggregator:
- Group By:
CUSTOMER_ID
- MIN(ORDER_DATE)
as FIRST_DATE
- MAX(ORDER_DATE)
as LAST_DATE
Joiner:
Source LEFT JOIN
Aggregator
WHERE ORDER_DATE
= FIRST_DATE OR ORDER_DATE = LAST_DATE
Method 3 - SQL Override:
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) as rn_first,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn_last
FROM orders
) WHERE rn_first = 1 OR rn_last = 1
20. Bulk Load
with Constraints and Indexes
Answer: Challenges:
•
Constraint validation slows inserts
•
Index maintenance overhead
•
Lock contention issues
Solutions:
1. Disable Constraints/Indexes:
--
Pre-session SQL
ALTER TABLE target_table DISABLE CONSTRAINT constraint_name;
DROP INDEX index_name;
-- Post-session
SQL
ALTER TABLE target_table ENABLE CONSTRAINT constraint_name;
CREATE INDEX index_name ON target_table(columns);
2. Bulk Loading Session Properties:
Target
Properties:
- Truncate Target
Table = Yes
- Bulk Load = Yes
- Enable
Constraint Based Loading = No
Database
Properties:
- Commit Interval
= 10000
- Enable High
Precision = No
3. Staging Approach:
Step 1: Bulk load
to staging table (no constraints)
Step 2: Validate
and cleanse data
Step 3: Insert
into final table with constraints
4. Parallel Loading:
- Partition data
by natural key
- Use multiple
sessions
- Load different
partitions simultaneously
21. Normalizer
Transformation Usage
Answer: Purpose: Convert single
row into multiple rows
Common Scenarios:
1. Column to Row Conversion:
Source (Single
Row):
EMP_ID |
JAN_SALES | FEB_SALES | MAR_SALES
101 | 10000
| 15000 | 12000
After Normalizer:
EMP_ID | MONTH |
SALES
101 | JAN
| 10000
101 | FEB
| 15000
101 | MAR
| 12000
2. Multi-valued Field Processing:
Source:
CUSTOMER_ID |
PHONE_NUMBERS
101 | 9876543210,9987654321,9123456789
After Normalizer:
CUSTOMER_ID |
PHONE_NUMBER
101 | 9876543210
101 | 9987654321
101 | 9123456789
Configuration:
Normalizer
Transformation:
- Number of
Occurrences: 3 (for 3 months)
- Generated
Column ID (GK): GCID_MONTH
- Generated
Instance ID (GI): GI_MONTH
22. Target Load
Plan
Answer: Target Load Plan defines
loading strategy:
Components:
1.
Load Type: Insert, Update, Delete, Upsert
2.
Integration Service: Which service executes
3.
Source Connection: Database connection info
4.
Target Connection: Target database info
5.
Commit Settings: Transaction control
Load Strategies:
1. Normal Load Plan:
- Session-based
execution
- Standard commit
intervals
- Error handling
through session logs
2. Bulk Load Plan:
- Direct path
loading
- Minimal logging
- Faster
performance
3. Incremental Load Plan:
- Delta
identification
- Change data
capture
- Merge/Upsert
operations
Configuration Example:
<TargetLoadPlan>
<LoadType>Insert</LoadType>
<BulkLoad>true</BulkLoad>
<CommitInterval>10000</CommitInterval>
<ErrorThreshold>100</ErrorThreshold>
</TargetLoadPlan>
Advanced Scenarios
23. Load
distinct data to Table1 and duplicates to Table2
Answer: Implementation using Router:
Step 1 - Identify Duplicates:
Aggregator
Transformation:
- Group By:
Business_Key_Columns
- COUNT(*) as
RECORD_COUNT
Joiner:
Source INNER JOIN
Aggregator ON Business_Key
Step 2 - Route Data:
Router
Transformation:
Group 1
(Distinct): RECORD_COUNT = 1 → Table1
Group 2
(Duplicates): RECORD_COUNT > 1 → Table2
Alternative Method - Using Rank:
Rank
Transformation:
- Group By:
Business_Key
- Order By:
Load_Date DESC
- Rank Output:
RANK_VALUE
Router:
Group 1:
RANK_VALUE = 1 → Table1 (Latest record)
Group 2:
RANK_VALUE > 1 → Table2 (Older duplicates)
24. Sequence
Generator for 100 rows with Row ID
Answer: Standard Approach:
Sequence
Generator Properties:
- Start Value: 1
- Increment By: 1
- End Value: 100
- Cycle: No
- Reset: No
Connect NEXTVAL
to target ROW_ID column
Alternative - Expression Transformation:
Variable Port:
$$ROW_COUNT =
$$ROW_COUNT + 1
Output Port:
ROW_ID =
$$ROW_COUNT
Reset $$ROW_COUNT
= 0 in session properties
Advanced - Custom Sequence:
Expression with
Custom Logic:
v_sequence =
IIF(ISNULL(v_prev_sequence), 1,
v_prev_sequence + 1)
ROW_ID = 'ROW_'
|| LPAD(v_sequence, 3, '0')
-- Output:
ROW_001, ROW_002, etc.
25. 100 records
with 10 cities - City-wise separate data
Answer: Method 1 - Router
Transformation:
Router Groups:
Group 1: CITY =
'Mumbai' → Mumbai_Target
Group 2: CITY =
'Delhi' → Delhi_Target
Group 3: CITY =
'Bangalore' → Bangalore_Target
...
Default Group:
Other cities → Other_Cities_Target
Method 2 - Dynamic File Creation:
Expression
Transformation:
v_filename =
'City_' || CITY || '_' || TO_CHAR(SYSDATE, 'YYYYMMDD') || '.csv'
File Target
Properties:
- Output File
Name: $PMTargetFileDir/$v_filename
- Create target
file at runtime
Method 3 - Using Normalizer:
If cities are in
columns:
Source:
RECORD_ID, MUMBAI_DATA, DELHI_DATA, BLR_DATA
Normalizer
converts to:
RECORD_ID, CITY,
DATA
Then route by
CITY
26. Shell Script
to Transfer Data Between Tables
Answer: Without Informatica - Direct
Database Methods:
Method 1 - Using sqlplus (Oracle):
#!/bin/bash
# Set environment
variables
export ORACLE_HOME=/path/to/oracle
export PATH=$ORACLE_HOME/bin:$PATH
# Database
connection details
DB_USER="username"
DB_PASS="password"
DB_SID="database_sid"
# Execute data
transfer
sqlplus -s $DB_USER/$DB_PASS@$DB_SID << EOF
INSERT INTO
target_table
SELECT * FROM
source_table
WHERE last_updated
>= TRUNC(SYSDATE);
COMMIT;
EXIT;
EOF
echo "Data transfer
completed at $(date)"
Method 2 - Using Data Pump:
#!/bin/bash
# Export data
expdp
username/password@database \
tables=source_table \
query=source_table:\"WHERE last_updated >= TRUNC\(SYSDATE\)\" \
dumpfile=daily_export.dmp \
directory=DATA_PUMP_DIR
# Import to target
impdp
username/password@target_database \
tables=target_table \
dumpfile=daily_export.dmp \
directory=DATA_PUMP_DIR \
table_exists_action=truncate
Method 3 - Using Database Link:
--
Create database link
CREATE DATABASE LINK target_db_link
CONNECT TO username IDENTIFIED BY password
USING 'target_database_tns';
-- Transfer data
INSERT INTO
target_table@target_db_link
SELECT * FROM source_table
WHERE condition;
Error Handling & Debugging
27. File deleted
while Informatica job is running
Answer: What Happens:
•
Session fails with file not found error
•
Partial data processing may occur
•
Transaction rollback if configured
Error Messages:
[ERROR] Source
file not found: /path/to/file.csv
[ERROR] Failed to
read from source
[ERROR] Session
terminated due to source error
Prevention Strategies:
1. File Locking:
# In
source system
exec 200>/path/to/file.csv.lock
flock -n 200 || exit 1
# Process file
rm
/path/to/file.csv.lock
2. File Validation:
#
Pre-session validation script
if [ ! -f "$SOURCE_FILE" ]; then
echo "Source file
not found: $SOURCE_FILE"
exit 1
fi
3. Staging Approach:
1. Copy file to
staging directory
2. Process from
staging
3. Archive after
processing
4. Session Configuration:
Session
Properties:
- Stop on Error:
No
- Error
Threshold: 0 (to capture all errors)
- Recovery
Strategy: Restart from beginning
Recovery Process:
1. Check session
logs for exact failure point
2. Verify file
availability
3. Restart
session or mapping
4. Implement
monitoring alerts
28. Primary Key
vs Unique Key
Answer:
Primary Key |
Unique Key |
Null Values: Not allowed |
Null Values: One NULL allowed |
Count: Only one per table |
Count: Multiple allowed |
Index: Clustered index created |
Index: Non-clustered index |
Purpose: Uniquely identify rows |
Purpose: Ensure uniqueness |
Modification: Cannot be changed |
Modification: Can be altered |
Example:
CREATE TABLE employees (
emp_id NUMBER PRIMARY KEY, -- Primary Key
email VARCHAR2(100) UNIQUE, -- Unique Key
phone VARCHAR2(15) UNIQUE, -- Another Unique Key
ssn VARCHAR2(11) UNIQUE NOT NULL -- Unique Key with NOT NULL
);
29. Can I make
Unique Key without NULL as constraint instead of Primary Key?
Answer: Yes, you can use UNIQUE NOT
NULL as alternative to PRIMARY KEY:
--
Instead of PRIMARY KEY
CREATE TABLE customers (
customer_id NUMBER UNIQUE NOT NULL, -- Acts like Primary Key
customer_name VARCHAR2(100),
email VARCHAR2(100) UNIQUE
);
-- Add constraint
later
ALTER TABLE customers
ADD CONSTRAINT uk_customer_id UNIQUE (customer_id);
ALTER TABLE customers
MODIFY customer_id NOT NULL;
Differences:
•
Functionally equivalent for uniqueness and non-null
•
No automatic clustering (depends on database)
•
Multiple UNIQUE NOT NULL constraints possible
•
Referential integrity - can still be referenced by foreign keys
Use Cases:
•
When business rules require
multiple unique identifiers
•
Legacy systems without formal
primary keys
•
Composite unique constraints
30. What is the
use of Primary Key?
Answer: Primary Functions:
1.
Unique Identification: Ensures each row is uniquely identifiable
2.
Referential Integrity: Serves as target for foreign key relationships
3.
Performance Optimization: Automatic index creation for fast access
4.
Data Consistency: Prevents duplicate data entry
5.
Replication: Used in database replication for change tracking
Database-Specific Benefits:
Oracle:
--
Enables efficient ROWID access
-- Supports
partition pruning
-- Required for
certain replication features
SQL Server:
--
Creates clustered index by default
-- Enables
efficient page splits
-- Required for
merge replication
In ETL Context:
•
SCD Type 2: Surrogate keys for historical tracking
•
Incremental Loading: Identify existing records for updates
•
Data Quality: Ensure data uniqueness across loads
31. To connect
Table A and B, what constraints are needed?
Answer: Primary Constraint - Foreign
Key Relationship:
--
Table A (Parent)
CREATE TABLE departments (
dept_id NUMBER PRIMARY KEY, -- Primary Key
dept_name VARCHAR2(100)
);
-- Table B
(Child)
CREATE TABLE employees (
emp_id NUMBER PRIMARY KEY, -- Primary Key
emp_name VARCHAR2(100),
dept_id NUMBER, --
Foreign Key
CONSTRAINT fk_emp_dept
FOREIGN KEY (dept_id)
REFERENCES
departments(dept_id)
);
Additional Constraints for Data Integrity:
1.
Check Constraints:
ALTER TABLE employees
ADD CONSTRAINT chk_emp_salary
CHECK (salary > 0);
2.
Not Null Constraints:
ALTER TABLE employees
MODIFY emp_name NOT NULL;
3.
Unique Constraints:
ALTER TABLE employees
ADD CONSTRAINT uk_emp_email
UNIQUE (email);
In Informatica Context:
•
Lookup Transformations: Use join keys
•
Joiner Transformations: Define master-detail relationships
•
Referential Integrity: Ensure valid relationships during ETL
32. Can you
implement SCD Type 2 on target flat file?
Answer: Challenges with Flat Files:
•
No built-in SCD support
•
No update capability
•
Sequential processing only
Workaround Solutions:
Method 1 - Multiple File Approach:
Current_Records.csv - Active records only
History_Records.csv - All historical versions
Archive_YYYYMMDD.csv
- Daily snapshots
Method 2 - Single File with Flags:
CustKey,CustID,Name,Address,StartDate,EndDate,CurrentFlag,Version
1,101,John,Bangalore,2023-01-01,2024-01-15,N,1
2,101,John,Hyderabad,2024-01-16,9999-12-31,Y,2
3,102,Jane,Mumbai,2023-06-01,9999-12-31,Y,1
Implementation Process:
1. Read existing
file into staging table
2. Apply SCD Type
2 logic in database
3. Export
complete dataset back to file
4. Replace
original file
Alternative - Database Approach:
1. Load flat file
to temporary table
2. Implement SCD
in database using SQL
3. Export final
result if flat file needed
33. Update data
into Excel and combine with inserted data in SCD2 - Complications?
Answer: Major Complications:
1.
Excel Limitations:
–
Row limit: 1,048,576 rows
–
No transaction support
–
No concurrent access
–
No indexing capabilities
2.
SCD Type 2 Challenges:
–
Cannot update existing rows in
place
–
Must rewrite entire file for
changes
–
Version tracking becomes
complex
–
Date calculations limited
3.
Performance Issues:
–
Large file handling problems
–
Memory consumption for big
datasets
–
Slow processing for complex
operations
Solutions:
Approach 1 - Database Staging:
1. Import Excel
to staging table
2. Perform SCD
operations in database
3. Export results
back to Excel
4. Handle file
versioning externally
Approach 2 - Separate Worksheets:
Sheet1:
Current_Data (active records)
Sheet2:
Historical_Data (all versions)
Sheet3:
Change_Log (audit trail)
Implementation:
Expression
Transformation:
v_excel_row =
IIF(SCD_ACTION = 'INSERT',
CUSTOMER_KEY||','||CUSTOMER_NAME||','||ADDRESS||','||START_DATE||',9999-12-31,Y',
CUSTOMER_KEY||','||CUSTOMER_NAME||','||ADDRESS||','||START_DATE||','||END_DATE||',N')
34. Format same,
4 files today, 1 file tomorrow - How to take file list in CSV?
Answer: Dynamic File Processing:
Method 1 - Parameter File:
FileDate,FileName,ProcessFlag
2024-01-15,Sales_20240115_File1.csv,Y
2024-01-15,Sales_20240115_File2.csv,Y
2024-01-15,Sales_20240115_File3.csv,Y
2024-01-15,Sales_20240115_File4.csv,Y
2024-01-16,Sales_20240116_File1.csv,Y
Method 2 - Session Parameters:
Session
Properties:
Parameter:
$FileList = 'file1.csv,file2.csv,file3.csv,file4.csv'
Source
Configuration:
Split($FileList,
',') for multiple sources
Method 3 - Workflow Variables:
Pre-session
Assignment:
$FileCount =
Get_File_Count('/data/input/', '*.csv')
$FileList =
Get_File_Names('/data/input/', 'Sales_' + TO_CHAR(SYSDATE, 'YYYYMMDD'))
Decision Task:
IIF($FileCount
> 0, 'PROCESS', 'SKIP')
Implementation Using Unix:
#!/bin/bash
# Generate file
list CSV
INPUT_DIR="/data/incoming"
OUTPUT_CSV="/config/file_list.csv"
PROCESS_DATE=$(date +%Y%m%d)
echo "ProcessDate,FileName,FileSize,Status" > $OUTPUT_CSV
find $INPUT_DIR -name "*${PROCESS_DATE}*.csv" -type f | while read file; do
filename=$(basename "$file")
filesize=$(stat -c%s "$file")
echo "${PROCESS_DATE},${filename},${filesize},PENDING" >> $OUTPUT_CSV
done
35. SCD2 Order:
Insert, Update as Insert, Update as Update
Answer: Correct Processing Order:
Step 1: Process Updates (Close old records)
--
Update existing records to close them
UPDATE customer_dim
SET end_date = CURRENT_DATE - 1,
current_flag = 'N'
WHERE customer_id IN (SELECT customer_id FROM staging_changes)
AND current_flag = 'Y';
Step 2: Insert New Versions (Update as Insert)
--
Insert new versions for changed records
INSERT INTO customer_dim (
customer_key, customer_id, customer_name,
address,
start_date, end_date, current_flag,
version_num
)
SELECT
customer_dim_seq.NEXTVAL,
customer_id, customer_name, address,
CURRENT_DATE, '9999-12-31', 'Y',
NVL(MAX(version_num), 0) + 1
FROM staging_changes s
GROUP BY customer_id,
customer_name, address;
Step 3: Insert New Records
--
Insert completely new customers
INSERT INTO customer_dim (
customer_key, customer_id, customer_name,
address,
start_date, end_date, current_flag,
version_num
)
SELECT
customer_dim_seq.NEXTVAL,
customer_id, customer_name, address,
CURRENT_DATE, '9999-12-31', 'Y', 1
FROM
staging_new_customers;
Informatica Implementation Order:
Mapping 1: Close
existing records (DD_UPDATE)
Mapping 2: Insert
new versions (DD_INSERT)
Mapping 3: Insert
new records (DD_INSERT)
Or single mapping
with Router:
Group 1: Updates
→ Update Strategy (DD_UPDATE)
Group 2: New
Versions → Update Strategy (DD_INSERT)
Group 3: New
Records → Update Strategy (DD_INSERT)
36. Bulk Load
with Constraints and Indexes - How to handle?
Answer: Challenges:
•
Constraint validation overhead
•
Index maintenance during load
•
Lock contention issues
•
Performance degradation
Solutions:
1. Disable/Enable Approach:
--
Pre-session SQL
ALTER TABLE target_table DISABLE CONSTRAINT constraint_name;
ALTER INDEX index_name UNUSABLE;
-- Post-session
SQL
ALTER TABLE target_table ENABLE CONSTRAINT constraint_name;
ALTER INDEX index_name REBUILD;
2. Bulk Mode Configuration:
Session
Properties:
- Enable Bulk
Mode: Yes
- Disable
Constraints: Yes (if supported)
- Commit
Interval: Large value (50,000)
- Enable High
Precision: No
Target
Properties:
- Truncate Target
Table: Yes (if full refresh)
- Output File
Type: Binary (for better performance)
3. Staging Table Approach:
--
Create staging table without constraints
CREATE TABLE staging_target AS
SELECT * FROM target_table WHERE 1=0;
-- Bulk load to
staging
-- Validate data
-- Insert to main
table with constraints
INSERT /*+
APPEND */ INTO target_table
SELECT * FROM staging_target
WHERE
validation_rules_pass;
4. Partitioned Loading:
-- Load
different partitions separately
-- Enables
parallel constraint checking
INSERT INTO target_table PARTITION (p_2024_01)
SELECT * FROM source_data
WHERE date_column >= DATE'2024-01-01'
AND date_column < DATE'2024-02-01';
37. Key Range
Partitioning Example
Answer: Scenario: Sales data
partitioned by date ranges for parallel processing
Partition Configuration:
Partition 1:
JAN-MAR 2024 (Q1)
Key Range: '2024-01-01' to '2024-03-31'
Partition 2:
APR-JUN 2024 (Q2)
Key Range: '2024-04-01' to '2024-06-30'
Partition 3:
JUL-SEP 2024 (Q3)
Key Range: '2024-07-01' to '2024-09-30'
Partition 4:
OCT-DEC 2024 (Q4)
Key Range: '2024-10-01' to '2024-12-31'
Implementation:
Session
Properties:
- Partition Type:
Key Range
- Partition Key:
SALE_DATE
- Number of
Partitions: 4
Partition
Details:
Partition 1:
SALE_DATE >= '2024-01-01' AND SALE_DATE <= '2024-03-31'
Partition 2:
SALE_DATE >= '2024-04-01' AND SALE_DATE <= '2024-06-30'
Partition 3:
SALE_DATE >= '2024-07-01' AND SALE_DATE <= '2024-09-30'
Partition 4:
SALE_DATE >= '2024-10-01' AND SALE_DATE <= '2024-12-31'
Benefits:
•
Parallel processing by quarters
•
Reduced memory usage per partition
•
Better resource utilization
•
Improved performance for date-based operations
Use Cases:
•
Large historical data
processing
•
Date-based incremental loads
•
Archival data processing
•
Report generation by time
periods
38. Load only
first and last record scenarios
Answer: Multiple Implementation
Methods:
Method 1 - Using Rank Transformation:
Source → Sorter
(ORDER BY date_column) →
Rank (Group By:
customer_id, Top 1) → Router →
Group 1: RANK = 1
(First Record)
Second pipeline:
Source → Sorter
(ORDER BY date_column DESC) →
Rank (Group By:
customer_id, Top 1) → Router →
Group 1: RANK = 1
(Last Record)
Method 2 - SQL Override in Source Qualifier:
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) as rn_first,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn_last
FROM orders
) WHERE rn_first = 1 OR rn_last = 1
Method 3 - Using Aggregator + Joiner:
Pipeline 1:
Source → Aggregator (Group By customer_id, MIN/MAX dates)
Pipeline 2:
Source (all records)
Joiner: Join on
customer_id and (date = min_date OR date = max_date)
Oracle SQL for First Row (from your scenario):
-- Get
first row for ID=1234
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY start_date ASC) as rn
FROM customer_history
WHERE ID = 1234
) WHERE rn = 1;
-- Alternative
using FIRST_VALUE
SELECT DISTINCT ID,
FIRST_VALUE(location) OVER (PARTITION BY ID ORDER BY start_date) as first_location,
FIRST_VALUE(end_date) OVER (PARTITION BY ID ORDER BY start_date) as first_end_date
FROM customer_history
WHERE ID = 1234;
39. Incremental
Load Implementation in Informatica
Answer: Common Strategies:
1. Timestamp-Based Incremental:
Source Qualifier
SQL:
SELECT * FROM
source_table
WHERE
last_modified_date > $LAST_EXTRACT_DATE
OR created_date > $LAST_EXTRACT_DATE
Workflow
Variables:
$LAST_EXTRACT_DATE
= GET_MAX_DATE('target_table', 'load_date')
$CURRENT_EXTRACT_DATE
= SYSDATE
2. Flag-Based Incremental:
Source Qualifier
SQL:
SELECT * FROM
source_table
WHERE
processed_flag = 'N'
Post-session SQL:
UPDATE
source_table
SET
processed_flag = 'Y',
processed_date = SYSDATE
WHERE
processed_flag = 'N'
3. Sequence-Based CDC:
Source Qualifier
SQL:
SELECT * FROM
source_table
WHERE sequence_id
> $LAST_SEQUENCE_ID
ORDER BY
sequence_id
Session
Parameters:
$LAST_SEQUENCE_ID
= GET_MAX_VALUE('control_table', 'last_sequence')
Implementation Components:
•
Control Table: Store extraction metadata
•
Workflow Variables: Dynamic parameter passing
•
Error Handling: Recovery and restart capabilities
•
Validation: Data completeness checks
40. Active vs
Passive Transformations with Scenarios
Answer:
Active Transformations (Change number of rows):
1.
Filter: Remove rows based on conditions
Scenario: Filter out cancelled orders
Condition:
ORDER_STATUS != 'CANCELLED'
Input: 1000 rows
→ Output: 850 rows
2.
Router: Route rows to multiple groups
Scenario: Route customers by region
Group 1: REGION =
'NORTH' → 300 rows
Group 2: REGION =
'SOUTH' → 400 rows
Group 3: Default
→ 300 rows
3.
Aggregator: Group and summarize data
Scenario: Monthly sales summary
Input: 10,000
daily transactions
Output: 12
monthly summaries
4.
Rank: Return top/bottom records
Scenario: Top 10 customers by sales
Input: 5,000
customers → Output: 10 customers
Passive Transformations (Same number of rows):
1.
Expression: Calculate derived values
Scenario: Calculate age from birth_date
v_age =
TRUNC((SYSDATE - BIRTH_DATE) / 365.25)
Input: 1000 rows
→ Output: 1000 rows
2.
Lookup: Retrieve reference data
Scenario: Get customer name from customer_id
Each input row
gets enriched with lookup data
Input: 1000 rows
→ Output: 1000 rows
3.
Update Strategy: Determine operation type
Scenario: Flag records for insert/update
DD_INSERT for new
records, DD_UPDATE for changes
Input: 1000 rows
→ Output: 1000 rows (with flags)
41. Sequence
Generator for 100 rows with Row ID
Answer: Implementation Options:
Method 1 - Standard Sequence Generator:
Sequence
Generator Properties:
- Start Value: 1
- Increment By:
1
- End Value: 100
- Number of
Cached Values: 1000
- Cycle: No
- Reset: No
Connection:
NEXTVAL → ROW_ID (target column)
Method 2 - Expression with Variable:
Expression
Transformation:
Variable Port:
$ROW_COUNT = $ROW_COUNT + 1
Output Port:
ROW_ID = $ROW_COUNT
Initialize
$ROW_COUNT = 0 in session properties
Method 3 - Custom Formatting:
Expression
Transformation:
v_sequence =
NEXTVAL (from Sequence Generator)
ROW_ID = 'ROW_'
|| LPAD(v_sequence, 5, '0')
Output:
ROW_00001, ROW_00002, ROW_00003, etc.
Method 4 - Date-based Sequence:
Expression
Transformation:
v_date_part =
TO_CHAR(SYSDATE, 'YYYYMMDD')
v_sequence =
NEXTVAL
ROW_ID =
v_date_part || '_' || LPAD(v_sequence, 4, '0')
Output:
20240115_0001, 20240115_0002, etc.
42. 100 records,
10 cities, city-wise separate data
Answer: Multiple Approaches:
Method 1 - Router Transformation:
Source (100
records) → Router Transformation
Router Groups:
Group 1: CITY =
'Mumbai' → Mumbai_Target (10 records)
Group 2: CITY =
'Delhi' → Delhi_Target (10 records)
Group 3: CITY =
'Bangalore' → Bangalore_Target (10 records)
Group 4: CITY =
'Chennai' → Chennai_Target (10 records)
Group 5: CITY =
'Hyderabad' → Hyderabad_Target (10 records)
Group 6: CITY =
'Kolkata' → Kolkata_Target (10 records)
Group 7: CITY =
'Pune' → Pune_Target (10 records)
Group 8: CITY =
'Ahmedabad' → Ahmedabad_Target (10 records)
Group 9: CITY =
'Jaipur' → Jaipur_Target (10 records)
Group 10: CITY =
'Lucknow' → Lucknow_Target (10 records)
Method 2 - Dynamic File Target:
Expression
Transformation:
v_filename =
'City_' || CITY || '_' || TO_CHAR(SYSDATE, 'YYYYMMDD') || '.csv'
File Target
Properties:
- File Name:
$PMTargetFileDir\$v_filename
- Create target
file at runtime: Yes
Output Files:
City_Mumbai_20240115.csv
City_Delhi_20240115.csv
City_Bangalore_20240115.csv,
etc.
Method 3 - Database Tables with City Suffix:
Expression
Transformation:
v_table_name =
'CUSTOMER_' || UPPER(CITY)
Target
Properties:
- Table Name:
$v_table_name
- Create target
table at runtime: Yes
Tables Created:
CUSTOMER_MUMBAI,
CUSTOMER_DELHI, CUSTOMER_BANGALORE, etc.
43. What happens
if aggregate function has no GROUP BY?
Answer: Behavior: Processes entire
dataset as single group
Example Scenario:
Source Data
(Sales Table):
PRODUCT AMOUNT
QUANTITY
Laptop 50000
1
Mobile 25000
2
Tablet 30000
1
Laptop 45000
1
Mobile 20000
1
Aggregator (No
GROUP BY):
- SUM(AMOUNT) =
170000
- COUNT(*) = 5
- AVG(AMOUNT) =
34000
- MAX(AMOUNT) =
50000
- MIN(AMOUNT) =
20000
Output: Single
row with aggregate values
Use Cases:
1.
Overall Statistics: Total sales, record counts
2.
Data Validation: Check data completeness
3.
Control Totals: For reconciliation purposes
4.
Summary Reports: High-level dashboard metrics
Implementation:
Aggregator
Transformation:
- No Group By
ports
- Aggregate
Expressions only:
- TOTAL_SALES = SUM(AMOUNT)
- RECORD_COUNT = COUNT(*)
- AVERAGE_SALE = AVG(AMOUNT)
- MAX_SALE = MAX(AMOUNT)
- MIN_SALE = MIN(AMOUNT)
Practical Application:
Source: Daily
transaction file (100,000 records)
Aggregator: No
GROUP BY
Output: Single
summary record for dashboard
LOAD_DATE |
TOTAL_AMOUNT | RECORD_COUNT | AVG_AMOUNT
2024-01-15|
1,250,000 | 100,000 | 12.50
44. Questions on
Partitions and Explain Plan
Answer: Partition Types in
Informatica:
1. Round-Robin Partitioning:
Use Case: Equal
distribution of data
Configuration: No
partition key required
Example: 1000
records → 4 partitions = 250 records each
2. Hash Partitioning:
Use Case: Related
data grouping
Configuration:
Partition key = CUSTOMER_ID
Example: All
records for same customer in same partition
3. Key Range Partitioning:
Use Case:
Sequential data processing
Configuration:
Partition key = ORDER_DATE
Range 1: 2024-Q1,
Range 2: 2024-Q2, etc.
4. Pass-Through Partitioning:
Use Case:
Maintain source partitioning
Configuration:
Inherit from upstream transformation
Example: Source
already partitioned by region
Database Explain Plan Analysis:
--
Oracle Explain Plan
EXPLAIN PLAN FOR
SELECT /*+
USE_HASH(a,b) */
a.customer_id, b.customer_name, SUM(a.amount)
FROM sales a, customers
b
WHERE a.customer_id = b.customer_id
GROUP BY a.customer_id,
b.customer_name;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
-- Key metrics to
analyze:
-- 1. Cost values
-- 2. Cardinality
estimates
-- 3. Access
methods (Index/Full Table Scan)
-- 4. Join methods
(Hash/Nested Loop/Sort Merge)
-- 5. Sort
operations
Informatica Session Performance Analysis:
Session Log
Analysis:
- Partition-wise
throughput
- Memory usage
per partition
- CPU utilization
- I/O wait times
- Buffer pool
statistics
45. Shell
Script: Transfer data between tables without Informatica
Answer: Multiple Database Methods:
Method 1 - Direct SQL (Oracle):
#!/bin/bash
# Database
connection details
DB_USER="etl_user"
DB_PASS="password"
DB_SID="PRODDB"
# Transfer data
using sqlplus
sqlplus -s $DB_USER/$DB_PASS@$DB_SID << EOF
-- Direct table to
table transfer
INSERT INTO
target_schema.customer_target
SELECT customer_id,
customer_name, address, phone, email, SYSDATE
FROM
source_schema.customer_source
WHERE
last_modified_date >= TRUNC(SYSDATE);
-- Show transfer
statistics
SELECT 'Transferred
' || SQL%ROWCOUNT || ' records' as status FROM dual;
COMMIT;
EXIT;
EOF
echo "Data transfer
completed at $(date)"
Method 2 - Using Database Link:
#!/bin/bash
sqlplus -s $DB_USER/$DB_PASS@$TARGET_DB << EOF
-- Create database
link if not exists
CREATE DATABASE LINK
source_db_link
CONNECT TO
source_user IDENTIFIED BY source_pass
USING
'source_database_tns';
-- Transfer data via
database link
INSERT INTO
local_target_table
SELECT * FROM
source_table@source_db_link
WHERE process_date =
TRUNC(SYSDATE);
COMMIT;
EXIT;
EOF
Method 3 - Export/Import Approach:
#!/bin/bash
# Export from
source
expdp
source_user/password@source_db \
tables=source_schema.sales_data \
query=sales_data:\"WHERE sale_date >= TRUNC\(SYSDATE\)\" \
dumpfile=daily_sales.dmp \
directory=DATA_PUMP_DIR \
logfile=export_sales.log
# Import to
target
impdp
target_user/password@target_db \
tables=target_schema.sales_data \
dumpfile=daily_sales.dmp \
directory=DATA_PUMP_DIR \
table_exists_action=append \
logfile=import_sales.log
echo "Export/Import
completed successfully"
Method 4 - Using MERGE Statement:
--
Advanced MERGE for UPSERT operations
MERGE INTO target_table t
USING (
SELECT customer_id,
customer_name, address, phone, email
FROM source_table
WHERE last_modified_date >= TRUNC(SYSDATE)
) s ON (t.customer_id = s.customer_id)
WHEN MATCHED THEN
UPDATE SET
t.customer_name = s.customer_name,
t.address = s.address,
t.phone = s.phone,
t.email = s.email,
t.last_updated = SYSDATE
WHEN NOT MATCHED THEN
INSERT (customer_id,
customer_name, address, phone, email, created_date)
VALUES (s.customer_id,
s.customer_name, s.address, s.phone, s.email, SYSDATE);
46. Have you
used Normalizer? Provide scenarios
Answer: Normalizer Use Cases:
Scenario 1 - Column to Row Conversion:
Source Data (Wide
Format):
EMP_ID |
JAN_SALES | FEB_SALES | MAR_SALES | APR_SALES
101 | 10000
| 15000 | 12000 | 18000
102 | 8000
| 9000 | 11000 | 14000
Normalizer
Configuration:
- Occurs: 4 (for
4 months)
- Generated
Column ID: GCID_MONTH
- Generated
Instance ID: GI_MONTH
Output
(Normalized):
EMP_ID | MONTH |
SALES | GCID_MONTH | GI_MONTH
101 | JAN
| 10000 | 1 | 1
101 | FEB
| 15000 | 1 | 2
101 | MAR
| 12000 | 1 | 3
101 | APR
| 18000 | 1 | 4
102 | JAN
| 8000 | 1 | 1
102 | FEB
| 9000 | 1 | 2
102 | MAR
| 11000 | 1 | 3
102 | APR
| 14000 | 1 | 4
Scenario 2 - Multi-valued Field Processing:
Source Data:
CUSTOMER_ID |
PHONE_NUMBERS
101 | 9876543210;9987654321;9123456789
102 | 9555666777;9444333222
Normalizer
Configuration:
- Occurs: 3
(maximum phones per customer)
- Level: 0 (for
variable occurrences)
Output:
CUSTOMER_ID |
PHONE_NUMBER | GI_PHONE
101 | 9876543210 | 1
101 | 9987654321 | 2
101 | 9123456789 | 3
102 | 9555666777 | 1
102 | 9444333222 | 2
102 | NULL | 3
Scenario 3 - COBOL Copybook Processing:
Source: COBOL
file with repeating groups
CUSTOMER-RECORD
CUSTOMER-ID
CUSTOMER-NAME
ORDER-DETAILS OCCURS 5 TIMES
ORDER-ID
ORDER-DATE
ORDER-AMOUNT
Normalizer
extracts each ORDER-DETAILS occurrence as separate row
Implementation Tips:
•
Set appropriate OCCURS value based on maximum repetitions
•
Handle NULL values for unused occurrences
•
Use GI (Generated Instance
ID) for sequence tracking
•
Combine with Expression for data cleanup
47. Connected vs
Unconnected Lookup - Performance comparison
Answer: Performance Analysis:
Connected Lookup:
Advantages:
✓ Single cache
build for entire mapping
✓ Pipeline
processing with data flow
✓ Better memory
utilization
✓ Automatic
caching optimization
✓ Supports
multiple return values efficiently
Performance
Characteristics:
- Cache built
once at session start
- O(n) processing
complexity
- Memory usage:
Single cache instance
- Best for:
Mandatory lookups, multiple return columns
Unconnected Lookup:
Advantages:
✓ Called only
when needed (conditional)
✓ Can be called
multiple times with different parameters
✓ Flexible
parameter passing
✓ Selective cache
usage
Performance
Characteristics:
- Cache built on
first call
- O(n*m) where m
= number of calls
- Memory usage:
Separate cache per lookup
- Best for:
Optional lookups, single return values
Performance Comparison Test:
Test Scenario: 1
million source records, 100,000 lookup records
Connected Lookup:
- Cache build
time: 30 seconds
- Processing
time: 5 minutes
- Total time: 5.5
minutes
- Memory usage:
512 MB
Unconnected
Lookup (called for every record):
- Cache build
time: 30 seconds
- Processing
time: 12 minutes
- Total time:
12.5 minutes
- Memory usage:
512 MB per lookup instance
Unconnected
Lookup (conditional - 30% records):
- Cache build
time: 30 seconds
- Processing
time: 4 minutes
- Total time: 4.5
minutes
- Memory usage:
512 MB
Recommendation:
•
Use Connected for lookups needed by all/most records
•
Use Unconnected for conditional or selective lookups
•
Consider cache size and available memory
•
Test with actual data
volumes for best decision
48. Data Quality
and Validation Transformations
Answer: Data Quality Scenarios:
1. Address Standardization:
Data Quality
Transformation:
Input: "123
main st apt 4b"
Rule: US Address
Standardization
Output: "123
Main Street Apartment 4B"
Configuration:
- Reference Data:
USPS database
- Standardization
Level: Parsing + Correction
- Output Ports:
Standardized address components
2. Name Matching and Deduplication:
Scenario:
Customer deduplication across systems
Data Quality
Rule:
- Phonetic
matching (Soundex, Double Metaphone)
- Fuzzy string
matching
- Composite
matching score
Implementation:
Input: John
Smith, Jon Smyth, J. Smith
Process:
Calculate similarity scores
Output: Matched
groups with confidence levels
3. Email Validation:
Expression
Transformation:
v_email_valid =
IIF(INSTR(EMAIL, '@') > 0 AND
INSTR(EMAIL, '.') > INSTR(EMAIL,
'@') AND
LENGTH(EMAIL) > 5 AND
EMAIL NOT LIKE '% %', 'VALID',
'INVALID')
Router Groups:
- Valid Emails →
Main processing
- Invalid Emails
→ Error handling
4. Phone Number Standardization:
Expression
Transformation:
-- Remove all
non-numeric characters
v_clean_phone =
REPLACESTR(0, PHONE_NUMBER, '[^0-9]', '')
-- Format based
on length
v_formatted_phone
=
IIF(LENGTH(v_clean_phone) = 10,
'(' || SUBSTR(v_clean_phone, 1, 3) ||
') ' ||
SUBSTR(v_clean_phone, 4, 3) || '-' ||
SUBSTR(v_clean_phone, 7, 4),
'INVALID')
49. Workflow
Design Patterns and Best Practices
Answer: Common Workflow Patterns:
1. Master-Detail Loading Pattern:
Workflow:
WF_CUSTOMER_ORDER_LOAD
├── Session:
S_LOAD_CUSTOMERS (Master)
├── Decision:
Check customer load success
├── Session:
S_LOAD_ORDERS (Detail)
└── Session:
S_UPDATE_CONTROL_TABLE
Dependencies:
- Orders load
only if customers succeed
- Foreign key
integrity maintained
- Error handling
at each level
2. Incremental with Fallback Pattern:
Workflow:
WF_INCREMENTAL_LOAD
├── Decision:
D_CHECK_INCREMENTAL_FLAG
├── Branch 1:
Incremental Load
│ ├── Session: S_INCREMENTAL_EXTRACT
│ └── Session: S_INCREMENTAL_TRANSFORM
├── Branch 2:
Full Load (if incremental fails)
│ ├── Session: S_FULL_EXTRACT
│ └── Session: S_FULL_TRANSFORM
└── Session:
S_UPDATE_STATISTICS
3. Parallel Processing Pattern:
Workflow:
WF_PARALLEL_REGIONS
├── Start
├── Concurrent
Sessions:
│ ├── S_LOAD_REGION_NORTH
│ ├── S_LOAD_REGION_SOUTH
│ ├── S_LOAD_REGION_EAST
│ └── S_LOAD_REGION_WEST
├── Wait (for all
sessions to complete)
└──
S_CONSOLIDATE_RESULTS
Benefits:
- Reduced total
processing time
- Better resource
utilization
- Independent
error handling per region
4. Data Validation Framework:
Workflow:
WF_DATA_VALIDATION
├── Session:
S_SOURCE_PROFILING
├── Decision:
D_DATA_QUALITY_CHECK
├── Branch 1:
Quality Passed
│ └── Session: S_MAIN_PROCESSING
├── Branch 2:
Quality Failed
│ ├── Session: S_ERROR_LOGGING
│ └── Session: S_SEND_ALERT
└── Session:
S_FINAL_REPORTING
50. Advanced
PowerExchange and Real-time Processing
Answer: PowerExchange Capabilities:
1. Real-time CDC (Change Data Capture):
Configuration:
Source: Oracle
Database with PowerExchange Agent
Capture:
Log-based CDC using archive logs
Target: Real-time
warehouse updates
PowerExchange
Settings:
- Capture Mode:
Continuous
- Log Mining:
Automatic
- Checkpoint
Frequency: Every 100 changes
- Recovery:
Automatic restart from checkpoint
2. Mainframe Data Integration:
Scenario: Extract
VSAM files from mainframe
PowerExchange
Configuration:
- Platform: z/OS
mainframe
- Data Format:
COBOL copybook
- Transfer
Method: TCP/IP or FTP
- Scheduling:
Event-driven or time-based
Mapping
Components:
Source:
PowerExchange Source
Transformation:
COBOL data type conversions
Target:
Relational database
3. SAP Integration:
PowerExchange for
SAP:
- Source: SAP R/3
tables and views
- Extraction:
Real-time or batch
- Delta
Processing: Built-in change capture
- Authentication:
SAP user credentials
Implementation:
1. Install
PowerExchange Agent on SAP system
2. Configure RFC
connections
3. Create source
definitions for SAP objects
4. Design
mappings with SAP-specific transformations
4. Message Queue Integration:
Scenario: Process
messages from IBM MQ
PowerExchange
Configuration:
- Queue Manager:
Connect to MQ queue manager
- Message Format:
XML, JSON, or fixed-width
- Processing
Mode: Continuous monitoring
- Error Handling:
Dead letter queue for failures
Real-time
Processing:
Source:
PowerExchange Message Queue
Transform: Parse
message content
Target: Database
table or file
Monitoring: Queue
depth and processing latency
Benefits of PowerExchange:
•
Native connectivity to various platforms
•
Real-time processing capabilities
•
Automatic metadata import
•
Built-in data type conversions
•
High performance for large volumes
•
Minimal impact on source systems
No comments:
Post a Comment