Unix/Linux for ETL Developers - 50 Essential Q&A

 

Unix/Linux for ETL Developers - 50 Essential Q&A

Basic File Operations (Questions 1-10)

1. Difference between find and grep commands

Answer:

Aspect

find

grep

Purpose

Search for files and directories

Search for text patterns within files

Scope

File system navigation

File content analysis

Output

File/directory paths

Matching lines from files

Examples:

# find examples
find /data -name "*.csv" -type f
find /logs -mtime -7

# grep examples 
grep "ERROR" /logs/app.log
grep -r "FAILED" /logs/

2. How to move files older than 15 days to archive folder?

Answer:

# Method 1: Using find
find /data/source -type f -mtime +15 -exec mv {} /data/archive/ \;

# Method 2: With logging
find /data/source -type f -mtime +15 -exec mv {} /data/archive/ \; \
  -exec echo "Moved: {}" \; >> /logs/archive.log

# Method 3: Create dated archive folder
ARCHIVE_DIR="/data/archive/$(date +%Y%m%d)"
mkdir -p $ARCHIVE_DIR
find /data/source -type f -mtime +15 -exec mv {} $ARCHIVE_DIR/ \;

3. How to check file permissions and change them?

Answer:

# Check permissions
ls -l filename.txt
ls -la /data/*.csv

# Change permissions
chmod 755 script.sh           # rwxr-xr-x
chmod 644 data.csv           # rw-r--r--
chmod u+x script.sh          # Add execute for owner
chmod g-w file.txt           # Remove write for group

# Change ownership
chown etluser:etlgroup file.csv
chgrp etlgroup *.txt

4. How to create, copy, and remove files/directories?

Answer:

# Create files and directories
touch newfile.txt
mkdir -p /data/processed/daily
mkdir /logs /archive

# Copy operations
cp source.csv target.csv
cp -r /data/source/* /data/backup/
cp -p file.txt /backup/      # Preserve permissions

# Remove operations
rm file.txt
rm -f *.tmp                  # Force remove
rm -rf /temp/old_data        # Recursive remove directory
rmdir empty_directory        # Remove empty directory only

5. How to view file contents using different commands?

Answer:

# View entire file
cat filename.txt
less filename.txt            # Paginated view
more filename.txt

# View parts of file
head -10 file.csv           # First 10 lines
tail -20 file.log           # Last 20 lines
head -1 data.csv            # Header line only
tail -n +2 data.csv         # Skip header line

# Real-time monitoring
tail -f /logs/etl.log       # Follow log file
tail -F /logs/etl.log       # Follow with file rotation

6. How to count lines, words, and characters in files?

Answer:

# Basic counting
wc filename.txt             # Lines, words, characters
wc -l file.csv              # Count lines only
wc -w file.txt              # Count words only
wc -c file.txt              # Count characters

# ETL specific counting
wc -l *.csv                 # Line count for all CSV files
find /data -name "*.csv" | xargs wc -l    # Total lines in all CSVs

# Count records (excluding header)
tail -n +2 data.csv | wc -l

# Count files in directory
ls -1 /data/*.csv | wc -l
find /data -name "*.csv" | wc -l

7. How to check file sizes and disk usage?

Answer:

# File sizes
ls -lh file.csv             # Human readable size
du -h file.csv              # Disk usage of file
du -sh /data/               # Total size of directory
du -sh /data/*              # Size of each subdirectory

# Largest files
du -h /data/* | sort -rh | head -10
find /data -type f -exec du -h {} + | sort -rh | head -20

# Disk space
df -h                       # All mounted filesystems
df -h /data                 # Specific path

8. How to compare two files?

Answer:

# Basic comparison
diff file1.txt file2.txt
diff -u file1.txt file2.txt    # Unified format

# Compare ignoring case/whitespace
diff -i file1.txt file2.txt    # Ignore case
diff -w file1.txt file2.txt    # Ignore whitespace

# Compare directories
diff -r dir1/ dir2/

# Check if files are identical
cmp file1.csv file2.csv
comm file1.txt file2.txt        # Show unique/common lines

# ETL validation comparison
sort file1.csv > file1_sorted.csv
sort file2.csv > file2_sorted.csv
diff file1_sorted.csv file2_sorted.csv

9. How to create symbolic links and hard links?

Answer:

# Symbolic (soft) links
ln -s /data/current/file.csv /data/latest.csv
ln -s /opt/scripts /home/user/scripts

# Hard links
ln /data/master.csv /data/backup.csv

# View links
ls -la                      # Shows link targets
find /data -type l          # Find all symbolic links
readlink /data/latest.csv   # Show target of symbolic link

# Remove links
rm linkname                 # Remove link (not target)
unlink linkname            # Alternative command

10. How to work with file timestamps?

Answer:

# View timestamps
stat filename.txt           # Detailed timestamp info
ls -lt                      # Sort by modification time
ls -lc                      # Sort by change time
ls -lu                      # Sort by access time

# Change timestamps
touch file.txt              # Update to current time
touch -t 202401150830 file.txt    # Set specific time (YYYYMMDDhhmm)

# Find files by timestamp
find /data -mtime -1        # Modified in last 24 hours
find /data -mtime +7        # Modified more than 7 days ago
find /data -newer reference_file   # Modified after reference_file

Text Processing (Questions 11-20)

11. How to use cut command for field extraction?

Answer:

# Extract specific columns
cut -d',' -f1,3,5 data.csv       # Columns 1, 3, 5 from CSV
cut -d'|' -f2-4 data.txt         # Columns 2 to 4 from pipe-delimited
cut -c1-10 file.txt              # Characters 1 to 10

# ETL examples
cut -d',' -f1 customers.csv > customer_ids.txt
cut -d',' -f2,3 --output-delimiter='|' data.csv    # Change delimiter

12. How to use awk for data processing?

Answer:

# Basic awk usage
awk '{print $1, $3}' file.txt             # Print columns 1 and 3
awk -F',' '{print $2}' data.csv           # CSV column 2
awk 'NR>1 {print $0}' data.csv            # Skip header line

# Conditional processing
awk -F',' '$3 > 1000 {print $1, $3}' sales.csv    # Sales > 1000
awk -F',' 'NR==1 {print} NR>1 && $4=="ACTIVE"' customers.csv

# Calculations
awk -F',' '{sum+=$3} END {print "Total:", sum}' amounts.csv
awk -F',' '{count++; sum+=$2} END {print "Average:", sum/count}' data.csv

# Field validation
awk -F',' 'NF!=5 {print "Line " NR " has " NF " fields"}' data.csv

13. How to use sed for text substitution and editing?

Answer:

# Basic substitution
sed 's/old/new/' file.txt              # Replace first occurrence per line
sed 's/old/new/g' file.txt             # Replace all occurrences
sed 's/old/new/gi' file.txt            # Case insensitive replacement

# Delete lines
sed '1d' file.csv                      # Delete first line (header)
sed '/pattern/d' file.txt              # Delete lines containing pattern
sed '1,5d' file.txt                    # Delete lines 1 to 5

# Insert/append lines
sed '1i\New Header Line' file.csv      # Insert line before line 1
sed '$a\Footer Line' file.txt          # Append line at end

# In-place editing
sed -i 's/,/|/g' data.csv             # Replace commas with pipes
sed -i.bak 's/old/new/g' file.txt     # Create backup before editing

14. How to sort and remove duplicates from files?

Answer:

# Basic sorting
sort file.txt                          # Alphabetical sort
sort -n numbers.txt                    # Numerical sort
sort -r file.txt                       # Reverse sort
sort -k2,2 -t',' data.csv             # Sort by 2nd column in CSV

# Remove duplicates
sort file.txt | uniq                   # Remove adjacent duplicates
sort -u file.txt                      # Sort and remove duplicates
uniq -c file.txt                       # Count occurrences

# ETL data cleaning
sort -t',' -k1,1 customers.csv | uniq > unique_customers.csv
awk '!seen[$1]++' data.csv > deduped_data.csv    # Remove duplicates by column 1

15. How to join files using Unix commands?

Answer:

# Basic join (requires sorted files)
join -t',' file1.csv file2.csv        # Join on first column
join -t',' -1 2 -2 1 file1.csv file2.csv    # Join file1 col2 with file2 col1

# Paste files side by side
paste file1.txt file2.txt             # Tab separated
paste -d',' file1.txt file2.txt       # Comma separated

# Merge files
cat file1.csv file2.csv > merged.csv  # Simple concatenation
sort file1.txt file2.txt > merged_sorted.txt

# ETL join example
sort -t',' -k1,1 customers.csv > customers_sorted.csv
sort -t',' -k1,1 orders.csv > orders_sorted.csv
join -t',' customers_sorted.csv orders_sorted.csv > customer_orders.csv

16. How to extract specific patterns using grep?

Answer:

# Basic pattern matching
grep "pattern" file.txt                # Simple string search
grep -i "error" log.txt                # Case insensitive
grep -v "DEBUG" log.txt                # Invert match (exclude)
grep -n "ERROR" log.txt                # Show line numbers

# Regular expressions
grep "^[0-9]" file.txt                 # Lines starting with digit
grep "[0-9]{3}-[0-9]{2}-[0-9]{4}" file.txt    # SSN pattern
grep -E "ERROR|WARN|FATAL" log.txt     # Multiple patterns

# File operations
grep -l "pattern" *.txt                # List files containing pattern
grep -c "SUCCESS" *.log                # Count occurrences
grep -r "config" /etc/                 # Recursive directory search

# ETL log analysis
grep -i "failed\|error" /logs/etl_*.log
grep "$(date +%Y-%m-%d)" /logs/process.log

17. How to work with CSV files using Unix commands?

Answer:

# View CSV structure
head -1 data.csv                       # Show header
awk -F',' '{print NF; exit}' data.csv  # Count columns

# Extract specific columns
awk -F',' '{print $1,$3,$5}' OFS=',' data.csv
cut -d',' -f2,4,6 data.csv

# Filter rows
awk -F',' '$3 > 100' sales.csv         # Numeric filter
awk -F',' '$2=="ACTIVE"' customers.csv # String filter

# Convert CSV to other formats
awk -F',' '{print $1"|"$2"|"$3}' data.csv    # CSV to pipe-delimited
sed 's/,/\t/g' data.csv                      # CSV to tab-delimited

# Validate CSV
awk -F',' 'NF!=5 {print "Invalid row at line " NR ": " $0}' data.csv
grep -n ',,\|,$\|^,' data.csv                # Find empty fields

18. How to handle special characters and encoding?

Answer:

# View file encoding
file filename.txt                      # Detect file type and encoding
hexdump -C file.txt | head            # View hex representation

# Convert encoding
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
dos2unix filename.txt                  # Convert Windows to Unix line endings
unix2dos filename.txt                  # Convert Unix to Windows line endings

# Remove special characters
tr -d '\r' < file.txt > clean_file.txt # Remove carriage returns
sed 's/[^a-zA-Z0-9,]//g' data.csv     # Keep only alphanumeric and commas
iconv -c -f UTF-8 -t ASCII file.txt   # Remove non-ASCII characters

# Handle quotes in CSV
sed 's/"//g' data.csv                  # Remove all quotes
awk -F'"' '{print $2}' quoted_data.csv # Extract quoted fields

19. How to split large files into smaller chunks?

Answer:

# Split by number of lines
split -l 10000 largefile.csv chunk_    # 10,000 lines per file
split -l 5000 data.csv data_part_      # Create data_part_aa, data_part_ab, etc.

# Split by file size
split -b 100M largefile.txt part_      # 100MB per file
split -C 50M data.csv chunk_           # Split at line boundaries, ~50MB

# Keep CSV headers in each chunk
head -1 data.csv > header.csv
tail -n +2 data.csv | split -l 5000 - chunk_
for file in chunk_*; do
    cat header.csv $file > processed_$file
done

# Rejoin files
cat chunk_* > rejoined_file.txt

20. How to generate reports and summaries from text files?

Answer:

# Count occurrences
sort file.txt | uniq -c | sort -nr    # Most frequent items first
awk '{count[$1]++} END {for(i in count) print i, count[i]}' data.txt

# Statistical summaries
awk '{sum+=$1; count++} END {print "Total:", sum, "Average:", sum/count}' numbers.txt
awk '{if(min==""){min=max=$1}; if($1>max) max=$1; if($1<min) min=$1; sum+=$1; count++}
     END {print "Min:", min, "Max:", max, "Avg:", sum/count}' numbers.txt

# Group by operations
awk -F',' '{sum[$1]+=$2} END {for(i in sum) print i, sum[i]}' sales.csv
sort -k1,1 data.txt | awk '{if($1==prev) sum+=$2; else {if(prev) print prev, sum; prev=$1; sum=$2}}
                           END {print prev, sum}'

# ETL data quality report
echo "Data Quality Report for $(basename $1)"
echo "Total records: $(wc -l < $1)"
echo "Null values: $(grep -c ',,' $1)"
echo "Unique values in column 1: $(cut -d',' -f1 $1 | sort -u | wc -l)"

Environment and Process Management (Questions 21-30)

21. How to set and use environment variables?

Answer:

# Set environment variables
export ETL_HOME=/opt/etl
export DATA_DIR=/data/current
export LOG_LEVEL=INFO

# Use in scripts
echo $ETL_HOME
cd $DATA_DIR
LOG_FILE="$ETL_HOME/logs/process.log"

# Make permanent (add to ~/.bashrc or ~/.profile)
echo 'export ETL_HOME=/opt/etl' >> ~/.bashrc
source ~/.bashrc

# View environment variables
env                                    # All environment variables
printenv ETL_HOME                     # Specific variable
echo $PATH                            # PATH variable

# ETL configuration example
export DB_HOST=localhost
export DB_PORT=1521
export DB_NAME=PROD
export ETL_CONFIG_FILE=/opt/etl/config/production.conf

22. How to manage processes and jobs?

Answer:

# View running processes
ps aux                                 # All processes
ps aux | grep etl                     # ETL-related processes
pgrep -f "informatica"                # Find processes by name

# Background jobs
./long_running_script.sh &            # Run in background
nohup ./script.sh > output.log 2>&1 & # Run and ignore hangup
jobs                                   # List background jobs
fg %1                                  # Bring job 1 to foreground
bg %1                                  # Send job 1 to background

# Kill processes
kill PID                              # Terminate process
kill -9 PID                           # Force kill
killall process_name                  # Kill by name
pkill -f "pattern"                    # Kill processes matching pattern

# Monitor processes
top                                   # Real-time process viewer
htop                                  # Enhanced process viewer

23. How to schedule jobs using cron?

Answer:

# View cron jobs
crontab -l                            # List current user's cron jobs
crontab -l -u username                # List another user's cron jobs

# Edit cron jobs
crontab -e                            # Edit cron table

# Cron syntax: minute hour day month dayofweek command
# Examples:
0 2 * * * /opt/etl/daily_load.sh             # Daily at 2 AM
0 */6 * * * /opt/etl/check_files.sh          # Every 6 hours
30 1 * * 1-5 /opt/etl/weekday_process.sh     # 1:30 AM weekdays
0 0 1 * * /opt/etl/monthly_archive.sh        # 1st of every month

# ETL scheduling examples
0 1 * * * /opt/etl/extract_data.sh > /logs/extract_$(date +\%Y\%m\%d).log 2>&1
15 2 * * * /opt/etl/transform_data.sh
30 3 * * * /opt/etl/load_data.sh
0 5 * * * /opt/etl/cleanup_temp_files.sh

# Log cron output
* * * * * /path/to/script.sh >> /logs/cron.log 2>&1

24. How to check system resources and performance?

Answer:

# CPU information
lscpu                                 # CPU details
cat /proc/cpuinfo                     # Detailed CPU info
uptime                               # System load average

# Memory usage
free -h                              # Memory usage (human readable)
cat /proc/meminfo                    # Detailed memory info
vmstat 1 5                          # Memory/CPU stats every second, 5 times

# Disk usage and I/O
df -h                               # Disk space usage
iostat -x 1                        # Disk I/O statistics
iotop                               # I/O usage by process

# Network
netstat -tuln                       # Network connections
ss -tuln                           # Socket statistics
iftop                              # Network bandwidth usage

# ETL performance monitoring
while true; do
  echo "$(date): CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}'), Memory: $(free | grep Mem | awk '{print $3/$2 * 100.0}')%"
  sleep 60
done

25. How to work with file permissions and ownership?

Answer:

# Understanding permissions (rwx for user, group, others)
ls -l file.txt                       # View permissions
# Example: -rw-r--r-- (644) = read/write for owner, read for group/others

# Numeric permission values
chmod 755 script.sh                  # rwxr-xr-x
chmod 644 data.csv                   # rw-r--r--
chmod 600 config.conf               # rw------- (owner only)

# Symbolic permissions
chmod u+x script.sh                  # Add execute for user
chmod g+w data.txt                   # Add write for group
chmod o-r secret.txt                 # Remove read for others
chmod a+r public.txt                 # Add read for all

# Ownership
chown etluser file.txt               # Change owner
chgrp etlgroup file.txt              # Change group
chown etluser:etlgroup file.txt      # Change both

# Recursive changes
chmod -R 755 /opt/etl/scripts/       # Apply to directory and contents
chown -R etluser:etlgroup /data/etl/

26. How to create and execute shell scripts?

Answer:

# Create script file
vi process_data.sh

# Basic script structure
#!/bin/bash
# Script description
# Author: ETL Team
# Date: $(date)

# Variables
DATA_DIR="/data/input"
LOG_FILE="/logs/process.log"

# Functions
log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S'): $1" >> $LOG_FILE
}

# Main logic
log_message "Processing started"
cd $DATA_DIR
for file in *.csv; do
    if [ -f "$file" ]; then
        log_message "Processing $file"
        # Process file
    fi
done
log_message "Processing completed"

# Make executable and run
chmod +x process_data.sh
./process_data.sh

# Pass arguments
./script.sh arg1 arg2
# Access in script: $1, $2, $@, $#

27. How to redirect input/output and use pipes?

Answer:

# Output redirection
command > output.txt                  # Redirect stdout to file (overwrite)
command >> output.txt                 # Redirect stdout to file (append)
command 2> error.txt                  # Redirect stderr to file
command > output.txt 2>&1             # Redirect both stdout and stderr
command &> output.txt                 # Redirect both (bash shortcut)

# Input redirection
command < input.txt                   # Use file as input
command << EOF                        # Here document
line 1
line 2
EOF

# Pipes
command1 | command2                   # Output of command1 to command2
grep "ERROR" log.txt | wc -l          # Count error lines
cut -d',' -f1 data.csv | sort | uniq # Extract and deduplicate column 1

# ETL examples
awk -F',' '$3 > 1000' sales.csv | sort -k2 > high_value_sales.csv
tail -f /logs/etl.log | grep --line-buffered "ERROR" | mail -s "ETL Errors" admin@company.com

28. How to handle command-line arguments in scripts?

Answer:

#!/bin/bash

# Access arguments
echo "Script name: $0"
echo "First argument: $1"
echo "Second argument: $2"
echo "All arguments: $@"
echo "Number of arguments: $#"

# Check for required arguments
if [ $# -lt 2 ]; then
    echo "Usage: $0 <input_file> <output_file>"
    exit 1
fi

INPUT_FILE="$1"
OUTPUT_FILE="$2"

# Validate file exists
if [ ! -f "$INPUT_FILE" ]; then
    echo "Error: Input file $INPUT_FILE not found"
    exit 1
fi

# Process with arguments
process_file() {
    local input="$1"
    local output="$2"
    local option="${3:-default}"  # Third argument with default
   
    awk -F',' '{print $1,$2}' OFS=',' "$input" > "$output"
}

# Use shift for processing multiple arguments
while [ $# -gt 0 ]; do
    case "$1" in
        -f|--file)
            FILE="$2"
            shift 2
            ;;
        -d|--debug)
            DEBUG=1
            shift
            ;;
        *)
            echo "Unknown option: $1"
            exit 1
            ;;
    esac
done

29. How to debug shell scripts and handle errors?

Answer:

# Debug options
#!/bin/bash
set -x                               # Print commands as executed
set -e                               # Exit on any error
set -u                               # Exit on undefined variable
set -o pipefail                      # Exit on pipe failure

# Combined options
set -euo pipefail

# Debug specific sections
set -x
# Debug this section
set +x

# Error handling
error_exit() {
    echo "Error: $1" >&2
    exit 1
}

# Validate file
[ -f "$INPUT_FILE" ] || error_exit "Input file not found: $INPUT_FILE"

# Check command success
if ! grep "pattern" file.txt > /dev/null; then
    echo "Pattern not found"
    exit 1
fi

# Trap errors
trap 'echo "Error at line $LINENO"' ERR
trap 'cleanup_temp_files' EXIT

# Test file operations
if [ -f "$file" ]; then
    echo "File exists"
elif [ -d "$file" ]; then
    echo "Directory exists"
else
    echo "File/directory not found"
fi

30. How to work with archives and compression?

Answer:

# Create archives
tar -czf archive.tar.gz /data/folder/    # Create compressed tar
tar -cf archive.tar /data/folder/        # Create uncompressed tar
zip -r archive.zip /data/folder/          # Create zip archive

# Extract archives
tar -xzf archive.tar.gz                  # Extract compressed tar
tar -xf archive.tar                      # Extract uncompressed tar
unzip archive.zip                        # Extract zip

# List archive contents
tar -tzf archive.tar.gz                  # List tar.gz contents
tar -tf archive.tar                      # List tar contents
unzip -l archive.zip                     # List zip contents

# Compress individual files
gzip file.txt                           # Creates file.txt.gz
gunzip file.txt.gz                      # Extracts to file.txt
compress file.txt                       # Creates file.txt.Z
uncompress file.txt.Z                   # Extracts to file.txt

# ETL archive examples
# Daily data archival
DATE=$(date +%Y%m%d)
tar -czf /archive/daily_data_$DATE.tar.gz /data/processed/
find /data/processed -type f -mtime +7 -delete

# Compress log files
gzip /logs/*.log
find /logs -name "*.log" -mtime +1 -exec gzip {} \;

Network and Data Transfer (Questions 31-40)

31. How to transfer files between systems?

Answer:

# SCP (Secure Copy)
scp file.txt user@server:/remote/path/        # Copy file to remote
scp user@server:/remote/file.txt ./           # Copy file from remote
scp -r /local/dir user@server:/remote/dir/    # Copy directory recursively

# RSYNC (synchronization)
rsync -av /local/dir/ user@server:/remote/dir/    # Sync directories
rsync -av --delete source/ destination/           # Sync and delete extras
rsync -av --exclude='*.log' source/ dest/         # Exclude patterns

# FTP/SFTP
sftp user@server
> put local_file.txt remote_file.txt
> get remote_file.txt local_file.txt
> mput *.csv                                   # Upload multiple files
> mget *.txt                                   # Download multiple files

# Wget/Curl
wget http://server.com/file.csv                # Download file
curl -O http://server.com/file.csv             # Download with curl
curl -T upload.csv ftp://server.com/path/      # Upload via FTP

# ETL data transfer example
#!/bin/bash
SOURCE_SERVER="data.company.com"
SOURCE_PATH="/exports/daily"
LOCAL_PATH="/data/incoming"

rsync -av --include='*.csv' --exclude='*' \
    user@$SOURCE_SERVER:$SOURCE_PATH/ $LOCAL_PATH/

32. How to test network connectivity and services?

Answer:

# Basic connectivity
ping server.com                       # Test reachability
ping -c 4 192.168.1.1                # Send 4 packets

# Port testing
telnet server.com 80                  # Test specific port
nc -zv server.com 80                  # Test port with netcat
nmap -p 80,443,22 server.com          # Scan multiple ports

# Service testing
curl -I http://server.com             # HTTP header check
wget --spider http://server.com       # Test without downloading

# DNS resolution
nslookup server.com                   # DNS lookup
dig server.com                        # Detailed DNS info
host server.com                       # Simple DNS lookup

# ETL connectivity script
#!/bin/bash
DB_HOST="db.company.com"
DB_PORT="1521"
WEB_SERVICE="http://api.company.com/health"

# Test database connectivity
if nc -zv $DB_HOST $DB_PORT; then
    echo "Database connection: OK"
else
    echo "Database connection: FAILED"
fi

# Test web service
if curl -f -s $WEB_SERVICE > /dev/null; then
    echo "Web service: OK"
else
    echo "Web service: FAILED"
fi

33. How to monitor logs in real-time?

Answer:

# Real-time log monitoring
tail -f /logs/application.log          # Follow single log
tail -F /logs/app.log                  # Follow with log rotation
multitail log1.txt log2.txt           # Monitor multiple logs

# Filter while monitoring
tail -f /logs/etl.log | grep "ERROR"
tail -f /logs/app.log | grep --color=always "ERROR\|WARN\|INFO"

# Multiple file monitoring
tail -f /logs/*.log                    # Follow all log files
find /logs -name "*.log" | xargs tail -f

# Advanced monitoring with alerts
tail -f /logs/etl.log | while read line; do
    if echo "$line" | grep -q "FATAL"; then
        echo "CRITICAL ERROR: $line" | mail -s "ETL Alert" admin@company.com
    fi
done

# Monitor log file size
while true; do
    for log in /logs/*.log; do
        size=$(du -m "$log" | cut -f1)
        if [ $size -gt 100 ]; then
            echo "Large log file: $log (${size}MB)"
        fi
    done
    sleep 300
done

34. How to parse and analyze log files?

Answer:

# Extract time ranges
awk '/2024-01-15 09:00/,/2024-01-15 17:00/' application.log

# Count log levels
grep -c "ERROR" /logs/app.log
grep -c "WARN" /logs/app.log
awk '/ERROR/{err++} /WARN/{warn++} /INFO/{info++}
     END {print "Errors:", err, "Warnings:", warn, "Info:", info}' /logs/app.log

# Find most frequent errors
grep "ERROR" /logs/app.log | cut -d' ' -f4- | sort | uniq -c | sort -nr

# Extract specific patterns
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' access.log | sort | uniq -c
awk '{print $1}' access.log | sort | uniq -c | sort -nr    # Top IP addresses

# ETL log analysis
#!/bin/bash
LOG_FILE="/logs/etl_$(date +%Y%m%d).log"

echo "ETL Process Summary for $(date +%Y-%m-%d)"
echo "=========================================="
echo "Total log entries: $(wc -l < $LOG_FILE)"
echo "Errors: $(grep -c ERROR $LOG_FILE)"
echo "Warnings: $(grep -c WARN $LOG_FILE)"
echo "Successful processes: $(grep -c SUCCESS $LOG_FILE)"
echo ""
echo "Top 5 Error Messages:"
grep ERROR $LOG_FILE | awk -F'ERROR' '{print $2}' | sort | uniq -c | sort -nr | head -5

35. How to work with databases from command line?

Answer:

# MySQL client
mysql -h hostname -u username -p database_name
mysql -e "SELECT COUNT(*) FROM table_name;" database_name

# PostgreSQL client
psql -h hostname -U username -d database_name
psql -c "SELECT * FROM table_name LIMIT 10;" database_name

# Oracle SQL*Plus
sqlplus username/password@hostname:port/service_name
sqlplus -s username/password@database << EOF
SELECT COUNT(*) FROM table_name;
EXIT;
EOF

# Export data
mysqldump -h host -u user -p database_name > backup.sql
pg_dump -h host -U user database_name > backup.sql

# ETL database operations
#!/bin/bash
DB_HOST="localhost"
DB_USER="etl_user"
DB_NAME="warehouse"

# Check row counts
mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "
SELECT
    table_name,
    table_rows
FROM information_schema.tables
WHERE table_schema='$DB_NAME'
ORDER BY table_rows DESC;" $DB_NAME

# Validate data load
EXPECTED_COUNT=10000
ACTUAL_COUNT=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -s -e "SELECT COUNT(*) FROM customer;" $DB_NAME)

if [ $ACTUAL_COUNT -eq $EXPECTED_COUNT ]; then
    echo "Data load validation: PASSED"
else
    echo "Data load validation: FAILED (Expected: $EXPECTED_COUNT, Actual: $ACTUAL_COUNT)"
fi

36. How to automate file transfers and data synchronization?

Answer:

# Automated file transfer script
#!/bin/bash
SOURCE_DIR="/data/exports"
DEST_SERVER="backup.company.com"
DEST_DIR="/backup/daily"
LOG_FILE="/logs/transfer_$(date +%Y%m%d).log"

transfer_files() {
    echo "$(date): Starting file transfer" >> $LOG_FILE
   
    # Use rsync for efficient transfer
    if rsync -av --progress $SOURCE_DIR/ user@$DEST_SERVER:$DEST_DIR/; then
        echo "$(date): Transfer completed successfully" >> $LOG_FILE
    else
        echo "$(date): Transfer failed" >> $LOG_FILE
        exit 1
    fi
}

# Check network connectivity first
if ping -c 1 $DEST_SERVER > /dev/null 2>&1; then
    transfer_files
else
    echo "$(date): Cannot reach destination server" >> $LOG_FILE
    exit 1
fi

# Bi-directional sync
rsync -av --delete /local/data/ user@server:/remote/data/
rsync -av --delete user@server:/remote/configs/ /local/configs/

# Scheduled in cron
# 0 2 * * * /scripts/transfer_files.sh > /dev/null 2>&1

37. How to handle file encryption and security?

Answer:

# GPG encryption
gpg --symmetric --cipher-algo AES256 sensitive_data.csv    # Encrypt with password
gpg --decrypt sensitive_data.csv.gpg > decrypted_data.csv  # Decrypt

# OpenSSL encryption
openssl enc -aes-256-cbc -salt -in data.csv -out data.csv.enc -k password
openssl enc -aes-256-cbc -d -in data.csv.enc -out data.csv -k password

# SSH keys for automated transfers
ssh-keygen -t rsa -b 4096 -C "etl@company.com"
ssh-copy-id user@server.com

# Secure file permissions
chmod 600 ~/.ssh/id_rsa                # Private key
chmod 644 ~/.ssh/id_rsa.pub            # Public key
chmod 700 ~/.ssh                       # SSH directory

# ETL secure transfer
#!/bin/bash
SENSITIVE_FILE="customer_data.csv"
ENCRYPTED_FILE="${SENSITIVE_FILE}.gpg"
PASSPHRASE_FILE="/secure/passphrase.txt"

# Encrypt before transfer
gpg --batch --yes --passphrase-file $PASSPHRASE_FILE \
    --symmetric --cipher-algo AES256 $SENSITIVE_FILE

# Transfer encrypted file
scp $ENCRYPTED_FILE user@server:/secure/location/

# Clean up local encrypted file
rm $ENCRYPTED_FILE

38. How to monitor system performance for ETL processes?

Answer:

# CPU and memory monitoring during ETL
#!/bin/bash
ETL_PROCESS="informatica\|datastage\|talend"
LOG_FILE="/logs/performance_$(date +%Y%m%d).log"

monitor_performance() {
    while true; do
        timestamp=$(date '+%Y-%m-%d %H:%M:%S')
       
        # Overall system stats
        cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//')
        memory_usage=$(free | grep Mem | awk '{printf "%.1f", $3/$2 * 100.0}')
       
        # ETL process specific
        etl_processes=$(pgrep -cf "$ETL_PROCESS")
        etl_memory=$(ps aux | grep -E "$ETL_PROCESS" | awk '{sum+=$6} END {print sum/1024}')
       
        # Disk I/O
        disk_io=$(iostat -x 1 2 | tail -1 | awk '{print $4, $5}')
       
        echo "$timestamp,CPU:${cpu_usage}%,Memory:${memory_usage}%,ETL_Processes:$etl_processes,ETL_Memory:${etl_memory}MB,DiskIO:$disk_io" >> $LOG_FILE
       
        sleep 60
    done
}

# Alert on high resource usage
check_thresholds() {
    cpu_limit=80
    memory_limit=85
   
    current_cpu=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//' | cut -d'.' -f1)
    current_memory=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}')
   
    if [ $current_cpu -gt $cpu_limit ]; then
        echo "High CPU usage: ${current_cpu}%" | mail -s "Performance Alert" admin@company.com
    fi
   
    if [ $current_memory -gt $memory_limit ]; then
        echo "High memory usage: ${current_memory}%" | mail -s "Performance Alert" admin@company.com
    fi
}

39. How to create data validation and quality checks?

Answer:

# File validation script
#!/bin/bash
validate_data_file() {
    local file="$1"
    local expected_columns="$2"
    local min_rows="$3"
   
    echo "Validating file: $file"
   
    # Check if file exists
    if [ ! -f "$file" ]; then
        echo "ERROR: File not found: $file"
        return 1
    fi
   
    # Check if file is not empty
    if [ ! -s "$file" ]; then
        echo "ERROR: File is empty: $file"
        return 1
    fi
   
    # Check column count
    actual_columns=$(head -1 "$file" | awk -F',' '{print NF}')
    if [ $actual_columns -ne $expected_columns ]; then
        echo "ERROR: Expected $expected_columns columns, found $actual_columns"
        return 1
    fi
   
    # Check minimum row count (excluding header)
    actual_rows=$(tail -n +2 "$file" | wc -l)
    if [ $actual_rows -lt $min_rows ]; then
        echo "ERROR: Expected at least $min_rows rows, found $actual_rows"
        return 1
    fi
   
    # Check for empty fields
    empty_fields=$(grep -c ',,' "$file")
    if [ $empty_fields -gt 0 ]; then
        echo "WARNING: Found $empty_fields rows with empty fields"
    fi
   
    echo "SUCCESS: File validation passed"
    return 0
}

# Data quality checks
check_data_quality() {
    local file="$1"
   
    echo "Data Quality Report for $file"
    echo "================================"
    echo "Total records: $(tail -n +2 "$file" | wc -l)"
    echo "Duplicate records: $(tail -n +2 "$file" | sort | uniq -d | wc -l)"
    echo "Records with null values: $(grep -c ',,' "$file")"
   
    # Check specific data patterns
    echo "Invalid email formats: $(awk -F',' '$3 !~ /@/ && $3 != "" {count++} END {print count+0}' "$file")"
    echo "Invalid phone numbers: $(awk -F',' '$4 !~ /^[0-9-+()\ ]+$/ && $4 != "" {count++} END {print count+0}' "$file")"
}

# Usage
validate_data_file "customers.csv" 5 1000
check_data_quality "customers.csv"

40. How to create automated backup and recovery procedures?

Answer:

# Automated backup script
#!/bin/bash
BACKUP_BASE="/backup"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

create_backup() {
    local source_dir="$1"
    local backup_name="$2"
   
    backup_dir="$BACKUP_BASE/$backup_name/$DATE"
    mkdir -p "$backup_dir"
   
    echo "$(date): Starting backup of $source_dir"
   
    # Create compressed backup
    if tar -czf "$backup_dir/data.tar.gz" -C "$source_dir" .; then
        echo "$(date): Backup completed successfully"
       
        # Create checksum for verification
        md5sum "$backup_dir/data.tar.gz" > "$backup_dir/checksum.md5"
       
        # Log backup info
        echo "$DATE,$source_dir,$backup_dir,$(du -sh $backup_dir/data.tar.gz | cut -f1)" >> "$BACKUP_BASE/backup.log"
       
    else
        echo "$(date): Backup failed"
        exit 1
    fi
}

# Cleanup old backups
cleanup_old_backups() {
    local backup_path="$1"
   
    find "$backup_path" -type d -mtime +$RETENTION_DAYS -exec rm -rf {} +
    echo "$(date): Cleaned up backups older than $RETENTION_DAYS days"
}

# Database backup
backup_database() {
    local db_name="$1"
    local backup_dir="$BACKUP_BASE/database/$DATE"
   
    mkdir -p "$backup_dir"
   
    # MySQL backup
    mysqldump -h localhost -u backup_user -p$DB_PASSWORD $db_name | gzip > "$backup_dir/${db_name}.sql.gz"
   
    # Verify backup
    if [ -s "$backup_dir/${db_name}.sql.gz" ]; then
        echo "$(date): Database backup completed"
    else
        echo "$(date): Database backup failed"
        exit 1
    fi
}

# Recovery procedure
restore_from_backup() {
    local backup_file="$1"
    local restore_location="$2"
   
    echo "$(date): Starting restore from $backup_file"
   
    # Verify checksum
    if md5sum -c "$(dirname $backup_file)/checksum.md5"; then
        echo "Checksum verification passed"
       
        # Extract backup
        tar -xzf "$backup_file" -C "$restore_location"
        echo "$(date): Restore completed"
    else
        echo "ERROR: Checksum verification failed"
        exit 1
    fi
}

# Execute backups
create_backup "/data/etl" "etl_data"
backup_database "warehouse"
cleanup_old_backups "$BACKUP_BASE"

Advanced ETL Operations (Questions 41-50)

41. How to implement file watching and automatic processing?

Answer:

# Using inotify to watch for new files
#!/bin/bash
WATCH_DIR="/data/incoming"
PROCESS_SCRIPT="/opt/etl/process_file.sh"
LOG_FILE="/logs/file_watcher.log"

# Install inotify-tools if needed: yum install inotify-tools

watch_directory() {
    echo "$(date): Starting file watcher for $WATCH_DIR" >> $LOG_FILE
   
    inotifywait -m -e close_write,moved_to --format '%w%f %e' "$WATCH_DIR" | while read file event; do
        # Only process CSV files
        if [[ "$file" == *.csv ]]; then
            echo "$(date): New file detected: $file ($event)" >> $LOG_FILE
           
            # Wait a moment to ensure file is completely written
            sleep 2
           
            # Process the file
            if $PROCESS_SCRIPT "$file"; then
                echo "$(date): Successfully processed $file" >> $LOG_FILE
                mv "$file" "/data/processed/"
            else
                echo "$(date): Failed to process $file" >> $LOG_FILE
                mv "$file" "/data/error/"
            fi
        fi
    done
}

# Alternative using simple polling
poll_directory() {
    while true; do
        for file in "$WATCH_DIR"/*.csv; do
            if [ -f "$file" ] && [ ! -f "$file.processing" ]; then
                # Mark file as being processed
                touch "$file.processing"
               
                echo "$(date): Processing $file" >> $LOG_FILE
               
                if $PROCESS_SCRIPT "$file"; then
                    mv "$file" "/data/processed/"
                    rm "$file.processing"
                else
                    mv "$file" "/data/error/"
                    rm "$file.processing"
                fi
            fi
        done
        sleep 30
    done
}

# Start watching
watch_directory

42. How to implement data reconciliation and comparison?

Answer:

# Data reconciliation script
#!/bin/bash
reconcile_data() {
    local source_file="$1"
    local target_file="$2"
    local key_column="$3"
    local report_file="$4"
   
    echo "Data Reconciliation Report" > "$report_file"
    echo "=========================" >> "$report_file"
    echo "Source: $source_file" >> "$report_file"
    echo "Target: $target_file" >> "$report_file"
    echo "Date: $(date)" >> "$report_file"
    echo "" >> "$report_file"
   
    # Count records
    source_count=$(tail -n +2 "$source_file" | wc -l)
    target_count=$(tail -n +2 "$target_file" | wc -l)
   
    echo "Record Counts:" >> "$report_file"
    echo "Source: $source_count" >> "$report_file"
    echo "Target: $target_count" >> "$report_file"
    echo "Difference: $((source_count - target_count))" >> "$report_file"
    echo "" >> "$report_file"
   
    # Extract keys for comparison
    tail -n +2 "$source_file" | cut -d',' -f"$key_column" | sort > /tmp/source_keys.txt
    tail -n +2 "$target_file" | cut -d',' -f"$key_column" | sort > /tmp/target_keys.txt
   
    # Find missing records
    echo "Records in source but not in target:" >> "$report_file"
    comm -23 /tmp/source_keys.txt /tmp/target_keys.txt >> "$report_file"
    echo "" >> "$report_file"
   
    echo "Records in target but not in source:" >> "$report_file"
    comm -13 /tmp/source_keys.txt /tmp/target_keys.txt >> "$report_file"
    echo "" >> "$report_file"
   
    # Calculate match percentage
    common_count=$(comm -12 /tmp/source_keys.txt /tmp/target_keys.txt | wc -l)
    if [ $source_count -gt 0 ]; then
        match_percentage=$(echo "scale=2; $common_count * 100 / $source_count" | bc)
        echo "Match Percentage: ${match_percentage}%" >> "$report_file"
    fi
   
    # Cleanup
    rm -f /tmp/source_keys.txt /tmp/target_keys.txt
}

# Compare file checksums
compare_file_integrity() {
    local file1="$1"
    local file2="$2"
   
    checksum1=$(md5sum "$file1" | cut -d' ' -f1)
    checksum2=$(md5sum "$file2" | cut -d' ' -f1)
   
    if [ "$checksum1" = "$checksum2" ]; then
        echo "Files are identical"
        return 0
    else
        echo "Files differ"
        echo "File1 checksum: $checksum1"
        echo "File2 checksum: $checksum2"
        return 1
    fi
}

# Usage
reconcile_data "source_data.csv" "target_data.csv" 1 "reconciliation_report.txt"

43. How to implement error handling and retry mechanisms?

Answer:

# Retry mechanism with exponential backoff
#!/bin/bash
retry_with_backoff() {
    local max_attempts="$1"
    local delay="$2"
    local command="$3"
    local attempt=1
   
    while [ $attempt -le $max_attempts ]; do
        echo "Attempt $attempt of $max_attempts: $command"
       
        if eval "$command"; then
            echo "Command succeeded on attempt $attempt"
            return 0
        else
            echo "Command failed on attempt $attempt"
           
            if [ $attempt -lt $max_attempts ]; then
                echo "Waiting $delay seconds before retry..."
                sleep $delay
                delay=$((delay * 2))  # Exponential backoff
            fi
           
            attempt=$((attempt + 1))
        fi
    done
   
    echo "Command failed after $max_attempts attempts"
    return 1
}

# ETL process with error handling
process_with_error_handling() {
    local input_file="$1"
    local output_file="$2"
    local error_file="$3"
    local log_file="$4"
   
    # Function to log messages
    log_message() {
        echo "$(date '+%Y-%m-%d %H:%M:%S'): $1" | tee -a "$log_file"
    }
   
    # Error cleanup function
    cleanup_on_error() {
        log_message "ERROR: Process failed, cleaning up temporary files"
        rm -f /tmp/process_$$_*
        mv "$input_file" "$error_file" 2>/dev/null
    }
   
    # Set error trap
    trap cleanup_on_error ERR
   
    log_message "Starting process for $input_file"
   
    # Validate input
    if [ ! -f "$input_file" ]; then
        log_message "ERROR: Input file not found: $input_file"
        return 1
    fi
   
    # Process with retry
    if retry_with_backoff 3 5 "your_etl_command $input_file $output_file"; then
        log_message "SUCCESS: Process completed for $input_file"
        mv "$input_file" "/data/processed/"
    else
        log_message "ERROR: Process failed after retries for $input_file"
        cleanup_on_error
        return 1
    fi
}

# Circuit breaker pattern
circuit_breaker() {
    local service_name="$1"
    local command="$2"
    local failure_threshold=5
    local timeout=300
    local state_file="/tmp/circuit_breaker_${service_name}"
   
    # Check circuit state
    if [ -f "$state_file" ]; then
        last_failure=$(cat "$state_file")
        current_time=$(date +%s)
       
        if [ $((current_time - last_failure)) -lt $timeout ]; then
            echo "Circuit breaker OPEN for $service_name"
            return 1
        fi
    fi
   
    # Try to execute command
    if eval "$command"; then
        rm -f "$state_file"  # Reset on success
        return 0
    else
        echo "$(date +%s)" > "$state_file"
        echo "Circuit breaker triggered for $service_name"
        return 1
    fi
}

44. How to implement parallel processing for large datasets?

Answer:

# Parallel file processing
#!/bin/bash
process_files_parallel() {
    local input_dir="$1"
    local output_dir="$2"
    local max_jobs="$3"
    local job_count=0
   
    # Function to process a single file
    process_single_file() {
        local file="$1"
        local output_dir="$2"
        local filename=$(basename "$file")
       
        echo "Processing $filename..."
       
        # Your processing logic here
        awk -F',' '{print $1,$2,$3}' OFS=',' "$file" > "$output_dir/$filename"
       
        echo "Completed $filename"
    }
   
    # Export function for parallel execution
    export -f process_single_file
   
    # Process files in parallel
    find "$input_dir" -name "*.csv" | \
    xargs -

No comments:

Post a Comment

Complete Data Engineering & BI Interview Preparation Guides

Complete Data Engineering & BI Interview Preparation Guides 📌 Data Engineering &...