Unix/Linux
for ETL Developers - 50 Essential Q&A
Basic File
Operations (Questions 1-10)
1. Difference
between find and grep commands
Answer:
Aspect |
find |
grep |
Purpose |
Search for files and directories |
Search for text patterns within files |
Scope |
File system navigation |
File content analysis |
Output |
File/directory paths |
Matching lines from files |
Examples:
# find
examples
find /data -name "*.csv" -type f
find /logs -mtime -7
# grep
examples
grep "ERROR" /logs/app.log
grep -r "FAILED" /logs/
2. How to move
files older than 15 days to archive folder?
Answer:
#
Method 1: Using find
find /data/source -type f -mtime +15 -exec mv {}
/data/archive/ \;
# Method 2: With
logging
find /data/source -type f -mtime +15 -exec mv {}
/data/archive/ \; \
-exec echo "Moved:
{}" \; >> /logs/archive.log
# Method 3: Create
dated archive folder
ARCHIVE_DIR="/data/archive/$(date +%Y%m%d)"
mkdir -p $ARCHIVE_DIR
find /data/source -type f -mtime +15 -exec mv {} $ARCHIVE_DIR/ \;
3. How to check
file permissions and change them?
Answer:
# Check
permissions
ls -l filename.txt
ls -la /data/*.csv
# Change
permissions
chmod 755 script.sh #
rwxr-xr-x
chmod 644 data.csv #
rw-r--r--
chmod u+x script.sh # Add execute for owner
chmod g-w file.txt #
Remove write for group
# Change ownership
chown etluser:etlgroup
file.csv
chgrp etlgroup *.txt
4. How to
create, copy, and remove files/directories?
Answer:
#
Create files and directories
touch newfile.txt
mkdir -p
/data/processed/daily
mkdir /logs /archive
# Copy operations
cp source.csv
target.csv
cp -r /data/source/* /data/backup/
cp -p file.txt
/backup/ #
Preserve permissions
# Remove
operations
rm file.txt
rm -f *.tmp # Force
remove
rm -rf /temp/old_data # Recursive remove
directory
rmdir
empty_directory #
Remove empty directory only
5. How to view
file contents using different commands?
Answer:
# View
entire file
cat filename.txt
less filename.txt #
Paginated view
more filename.txt
# View parts of
file
head -10 file.csv # First
10 lines
tail -20 file.log # Last
20 lines
head -1 data.csv #
Header line only
tail -n +2 data.csv # Skip header line
# Real-time
monitoring
tail -f /logs/etl.log # Follow log file
tail -F /logs/etl.log # Follow with file rotation
6. How to count
lines, words, and characters in files?
Answer:
# Basic
counting
wc filename.txt #
Lines, words, characters
wc -l file.csv # Count
lines only
wc -w file.txt # Count
words only
wc -c file.txt # Count
characters
# ETL specific
counting
wc -l *.csv # Line
count for all CSV files
find /data -name "*.csv" | xargs wc -l # Total lines in all CSVs
# Count records
(excluding header)
tail -n +2 data.csv | wc -l
# Count files in
directory
ls -1 /data/*.csv | wc -l
find /data -name "*.csv" | wc -l
7. How to check
file sizes and disk usage?
Answer:
# File
sizes
ls -lh file.csv # Human
readable size
du -h file.csv # Disk
usage of file
du -sh /data/ # Total
size of directory
du -sh /data/* # Size
of each subdirectory
# Largest files
du -h /data/* | sort -rh | head -10
find /data -type f -exec du -h {} + | sort -rh | head -20
# Disk space
df -h # All
mounted filesystems
df -h /data #
Specific path
8. How to
compare two files?
Answer:
# Basic
comparison
diff file1.txt file2.txt
diff -u file1.txt
file2.txt #
Unified format
# Compare ignoring
case/whitespace
diff -i file1.txt
file2.txt #
Ignore case
diff -w file1.txt
file2.txt #
Ignore whitespace
# Compare
directories
diff -r dir1/ dir2/
# Check if files
are identical
cmp file1.csv file2.csv
comm file1.txt
file2.txt # Show
unique/common lines
# ETL validation
comparison
sort file1.csv > file1_sorted.csv
sort file2.csv > file2_sorted.csv
diff file1_sorted.csv
file2_sorted.csv
9. How to create
symbolic links and hard links?
Answer:
#
Symbolic (soft) links
ln -s
/data/current/file.csv /data/latest.csv
ln -s /opt/scripts
/home/user/scripts
# Hard links
ln /data/master.csv
/data/backup.csv
# View links
ls -la # Shows
link targets
find /data -type l # Find all symbolic links
readlink
/data/latest.csv # Show
target of symbolic link
# Remove links
rm linkname #
Remove link (not target)
unlink linkname #
Alternative command
10. How to work
with file timestamps?
Answer:
# View
timestamps
stat filename.txt #
Detailed timestamp info
ls -lt # Sort
by modification time
ls -lc # Sort
by change time
ls -lu # Sort
by access time
# Change
timestamps
touch file.txt #
Update to current time
touch -t 202401150830
file.txt # Set
specific time (YYYYMMDDhhmm)
# Find files by
timestamp
find /data -mtime -1 # Modified in last 24 hours
find /data -mtime +7 # Modified more than 7 days
ago
find /data -newer reference_file # Modified after
reference_file
Text Processing
(Questions 11-20)
11. How to use
cut command for field extraction?
Answer:
# Extract specific columns
cut -d',' -f1,3,5 data.csv # Columns 1, 3, 5 from CSV
cut -d'|' -f2-4 data.txt # Columns 2 to 4 from
pipe-delimited
cut -c1-10 file.txt #
Characters 1 to 10
# ETL examples
cut -d',' -f1 customers.csv > customer_ids.txt
cut -d',' -f2,3 --output-delimiter='|' data.csv # Change delimiter
12. How to use
awk for data processing?
Answer:
# Basic awk usage
awk '{print $1, $3}' file.txt # Print
columns 1 and 3
awk -F',' '{print $2}' data.csv # CSV
column 2
awk 'NR>1 {print $0}' data.csv # Skip
header line
# Conditional
processing
awk -F',' '$3 > 1000 {print $1, $3}' sales.csv # Sales > 1000
awk -F',' 'NR==1 {print} NR>1 &&
$4=="ACTIVE"' customers.csv
# Calculations
awk -F',' '{sum+=$3} END {print "Total:",
sum}' amounts.csv
awk -F',' '{count++; sum+=$2} END {print
"Average:", sum/count}' data.csv
# Field validation
awk -F',' 'NF!=5 {print "Line " NR "
has " NF " fields"}' data.csv
13. How to use
sed for text substitution and editing?
Answer:
# Basic substitution
sed 's/old/new/' file.txt #
Replace first occurrence per line
sed 's/old/new/g' file.txt #
Replace all occurrences
sed 's/old/new/gi' file.txt # Case
insensitive replacement
# Delete lines
sed '1d' file.csv #
Delete first line (header)
sed '/pattern/d' file.txt #
Delete lines containing pattern
sed '1,5d' file.txt #
Delete lines 1 to 5
# Insert/append
lines
sed '1i\New Header Line' file.csv # Insert line before line 1
sed '$a\Footer Line' file.txt # Append line at end
# In-place editing
sed -i 's/,/|/g' data.csv #
Replace commas with pipes
sed -i.bak 's/old/new/g' file.txt # Create backup before
editing
14. How to sort
and remove duplicates from files?
Answer:
# Basic sorting
sort file.txt #
Alphabetical sort
sort -n numbers.txt #
Numerical sort
sort -r file.txt #
Reverse sort
sort -k2,2 -t',' data.csv # Sort
by 2nd column in CSV
# Remove
duplicates
sort file.txt | uniq #
Remove adjacent duplicates
sort -u file.txt # Sort
and remove duplicates
uniq -c file.txt # Count
occurrences
# ETL data
cleaning
sort -t',' -k1,1 customers.csv | uniq > unique_customers.csv
awk '!seen[$1]++' data.csv > deduped_data.csv # Remove duplicates by
column 1
15. How to join
files using Unix commands?
Answer:
# Basic join (requires
sorted files)
join -t',' file1.csv file2.csv # Join on first column
join -t',' -1 2 -2 1 file1.csv file2.csv # Join file1 col2 with
file2 col1
# Paste files side
by side
paste file1.txt file2.txt # Tab
separated
paste -d',' file1.txt file2.txt # Comma separated
# Merge files
cat file1.csv file2.csv > merged.csv # Simple concatenation
sort file1.txt file2.txt > merged_sorted.txt
# ETL join example
sort -t',' -k1,1 customers.csv > customers_sorted.csv
sort -t',' -k1,1 orders.csv > orders_sorted.csv
join -t',' customers_sorted.csv orders_sorted.csv > customer_orders.csv
16. How to
extract specific patterns using grep?
Answer:
# Basic pattern matching
grep "pattern" file.txt #
Simple string search
grep -i "error" log.txt # Case
insensitive
grep -v "DEBUG" log.txt #
Invert match (exclude)
grep -n "ERROR" log.txt # Show
line numbers
# Regular
expressions
grep "^[0-9]" file.txt # Lines
starting with digit
grep "[0-9]{3}-[0-9]{2}-[0-9]{4}" file.txt # SSN pattern
grep -E "ERROR|WARN|FATAL" log.txt # Multiple patterns
# File operations
grep -l "pattern" *.txt # List
files containing pattern
grep -c "SUCCESS" *.log # Count
occurrences
grep -r "config" /etc/ #
Recursive directory search
# ETL log analysis
grep -i "failed\|error" /logs/etl_*.log
grep "$(date +%Y-%m-%d)" /logs/process.log
17. How to work
with CSV files using Unix commands?
Answer:
# View CSV structure
head -1 data.csv # Show
header
awk -F',' '{print NF; exit}' data.csv
# Count columns
# Extract specific
columns
awk -F',' '{print $1,$3,$5}' OFS=',' data.csv
cut -d',' -f2,4,6 data.csv
# Filter rows
awk -F',' '$3 > 100' sales.csv # Numeric filter
awk -F',' '$2=="ACTIVE"' customers.csv #
String filter
# Convert CSV to
other formats
awk -F',' '{print $1"|"$2"|"$3}' data.csv # CSV to pipe-delimited
sed 's/,/\t/g' data.csv # CSV
to tab-delimited
# Validate CSV
awk -F',' 'NF!=5 {print "Invalid row at line
" NR ": " $0}' data.csv
grep -n ',,\|,$\|^,' data.csv # Find
empty fields
18. How to
handle special characters and encoding?
Answer:
# View file encoding
file filename.txt #
Detect file type and encoding
hexdump -C file.txt | head
# View hex representation
# Convert encoding
iconv -f ISO-8859-1 -t UTF-8 input.txt > output.txt
dos2unix filename.txt #
Convert Windows to Unix line endings
unix2dos filename.txt #
Convert Unix to Windows line endings
# Remove special
characters
tr -d '\r' < file.txt > clean_file.txt #
Remove carriage returns
sed 's/[^a-zA-Z0-9,]//g' data.csv # Keep only alphanumeric
and commas
iconv -c -f UTF-8 -t ASCII file.txt # Remove non-ASCII
characters
# Handle quotes in
CSV
sed 's/"//g' data.csv #
Remove all quotes
awk -F'"' '{print $2}' quoted_data.csv #
Extract quoted fields
19. How to split
large files into smaller chunks?
Answer:
# Split by number of lines
split -l 10000 largefile.csv chunk_ # 10,000 lines per file
split -l 5000 data.csv data_part_ # Create data_part_aa,
data_part_ab, etc.
# Split by file
size
split -b 100M largefile.txt part_ # 100MB per file
split -C 50M data.csv chunk_ # Split
at line boundaries, ~50MB
# Keep CSV headers
in each chunk
head -1 data.csv > header.csv
tail -n +2 data.csv | split -l 5000 - chunk_
for file in chunk_*; do
cat header.csv $file > processed_$file
done
# Rejoin files
cat chunk_* > rejoined_file.txt
20. How to
generate reports and summaries from text files?
Answer:
# Count occurrences
sort file.txt | uniq -c | sort -nr # Most
frequent items first
awk '{count[$1]++} END {for(i in count) print
i, count[i]}' data.txt
# Statistical
summaries
awk '{sum+=$1; count++} END {print
"Total:", sum, "Average:", sum/count}' numbers.txt
awk '{if(min==""){min=max=$1};
if($1>max) max=$1; if($1<min) min=$1; sum+=$1; count++}
END {print "Min:", min, "Max:", max,
"Avg:", sum/count}' numbers.txt
# Group by
operations
awk -F',' '{sum[$1]+=$2} END {for(i in sum) print i,
sum[i]}' sales.csv
sort -k1,1 data.txt | awk '{if($1==prev) sum+=$2; else {if(prev)
print prev, sum; prev=$1; sum=$2}}
END {print prev,
sum}'
# ETL data quality
report
echo "Data Quality Report for $(basename $1)"
echo "Total records: $(wc -l < $1)"
echo "Null values: $(grep -c ',,' $1)"
echo "Unique values in column 1: $(cut -d',' -f1 $1 | sort -u | wc -l)"
Environment and
Process Management (Questions 21-30)
21. How to set
and use environment variables?
Answer:
# Set
environment variables
export ETL_HOME=/opt/etl
export DATA_DIR=/data/current
export LOG_LEVEL=INFO
# Use in scripts
echo $ETL_HOME
cd $DATA_DIR
LOG_FILE="$ETL_HOME/logs/process.log"
# Make permanent
(add to ~/.bashrc or ~/.profile)
echo 'export
ETL_HOME=/opt/etl' >> ~/.bashrc
source ~/.bashrc
# View environment
variables
env # All
environment variables
printenv ETL_HOME #
Specific variable
echo $PATH # PATH
variable
# ETL
configuration example
export DB_HOST=localhost
export DB_PORT=1521
export DB_NAME=PROD
export ETL_CONFIG_FILE=/opt/etl/config/production.conf
22. How to
manage processes and jobs?
Answer:
# View
running processes
ps aux # All
processes
ps aux | grep etl #
ETL-related processes
pgrep -f "informatica" # Find
processes by name
# Background jobs
./long_running_script.sh & # Run
in background
nohup ./script.sh > output.log 2>&1 & # Run
and ignore hangup
jobs # List
background jobs
fg %1 # Bring
job 1 to foreground
bg %1 # Send
job 1 to background
# Kill processes
kill PID #
Terminate process
kill -9 PID # Force
kill
killall process_name # Kill
by name
pkill -f "pattern" # Kill
processes matching pattern
# Monitor
processes
top #
Real-time process viewer
htop #
Enhanced process viewer
23. How to
schedule jobs using cron?
Answer:
# View
cron jobs
crontab -l # List
current user's cron jobs
crontab -l -u username # List
another user's cron jobs
# Edit cron jobs
crontab -e # Edit
cron table
# Cron syntax:
minute hour day month dayofweek command
# Examples:
0 2 * * *
/opt/etl/daily_load.sh # Daily
at 2 AM
0 */6 * * *
/opt/etl/check_files.sh # Every
6 hours
30 1 * * 1-5
/opt/etl/weekday_process.sh # 1:30
AM weekdays
0 0 1 * *
/opt/etl/monthly_archive.sh # 1st
of every month
# ETL scheduling
examples
0 1 * * *
/opt/etl/extract_data.sh > /logs/extract_$(date +\%Y\%m\%d).log 2>&1
15 2 * * *
/opt/etl/transform_data.sh
30 3 * * *
/opt/etl/load_data.sh
0 5 * * *
/opt/etl/cleanup_temp_files.sh
# Log cron output
* * * * * /path/to/script.sh >> /logs/cron.log 2>&1
24. How to check
system resources and performance?
Answer:
# CPU
information
lscpu # CPU
details
cat /proc/cpuinfo #
Detailed CPU info
uptime #
System load average
# Memory usage
free -h #
Memory usage (human readable)
cat /proc/meminfo #
Detailed memory info
vmstat 1 5 #
Memory/CPU stats every second, 5 times
# Disk usage and
I/O
df -h # Disk
space usage
iostat -x 1 # Disk I/O statistics
iotop # I/O
usage by process
# Network
netstat -tuln #
Network connections
ss -tuln #
Socket statistics
iftop #
Network bandwidth usage
# ETL performance
monitoring
while true; do
echo "$(date): CPU: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}'), Memory: $(free | grep Mem | awk '{print $3/$2 *
100.0}')%"
sleep 60
done
25. How to work
with file permissions and ownership?
Answer:
#
Understanding permissions (rwx for user, group, others)
ls -l file.txt # View
permissions
# Example:
-rw-r--r-- (644) = read/write for owner, read for group/others
# Numeric
permission values
chmod 755 script.sh #
rwxr-xr-x
chmod 644 data.csv #
rw-r--r--
chmod 600
config.conf #
rw------- (owner only)
# Symbolic
permissions
chmod u+x script.sh # Add
execute for user
chmod g+w data.txt # Add
write for group
chmod o-r secret.txt #
Remove read for others
chmod a+r public.txt # Add
read for all
# Ownership
chown etluser
file.txt #
Change owner
chgrp etlgroup
file.txt #
Change group
chown etluser:etlgroup
file.txt #
Change both
# Recursive
changes
chmod -R 755
/opt/etl/scripts/ # Apply
to directory and contents
chown -R etluser:etlgroup
/data/etl/
26. How to
create and execute shell scripts?
Answer:
#
Create script file
vi process_data.sh
# Basic script
structure
#!/bin/bash
# Script
description
# Author: ETL Team
# Date: $(date)
# Variables
DATA_DIR="/data/input"
LOG_FILE="/logs/process.log"
# Functions
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S'): $1" >> $LOG_FILE
}
# Main logic
log_message "Processing
started"
cd $DATA_DIR
for file in *.csv; do
if [ -f "$file" ]; then
log_message "Processing $file"
# Process file
fi
done
log_message "Processing
completed"
# Make executable
and run
chmod +x process_data.sh
./process_data.sh
# Pass arguments
./script.sh arg1 arg2
# Access in
script: $1, $2, $@, $#
27. How to
redirect input/output and use pipes?
Answer:
#
Output redirection
command > output.txt #
Redirect stdout to file (overwrite)
command >> output.txt #
Redirect stdout to file (append)
command 2> error.txt #
Redirect stderr to file
command > output.txt 2>&1 #
Redirect both stdout and stderr
command &> output.txt #
Redirect both (bash shortcut)
# Input
redirection
command < input.txt # Use
file as input
command << EOF # Here
document
line 1
line 2
EOF
# Pipes
command1 | command2 #
Output of command1 to command2
grep "ERROR" log.txt | wc -l # Count error lines
cut -d',' -f1 data.csv | sort | uniq #
Extract and deduplicate column 1
# ETL examples
awk -F',' '$3 > 1000' sales.csv | sort -k2 >
high_value_sales.csv
tail -f /logs/etl.log | grep --line-buffered "ERROR" | mail -s "ETL
Errors" admin@company.com
28. How to
handle command-line arguments in scripts?
Answer:
#!/bin/bash
# Access arguments
echo "Script name: $0"
echo "First
argument: $1"
echo "Second
argument: $2"
echo "All arguments:
$@"
echo "Number of
arguments: $#"
# Check for
required arguments
if [ $# -lt 2 ]; then
echo "Usage: $0 <input_file>
<output_file>"
exit 1
fi
INPUT_FILE="$1"
OUTPUT_FILE="$2"
# Validate file
exists
if [ ! -f "$INPUT_FILE" ]; then
echo "Error: Input
file $INPUT_FILE not found"
exit 1
fi
# Process with
arguments
process_file() {
local input="$1"
local output="$2"
local option="${3:-default}" # Third argument with
default
awk -F',' '{print $1,$2}' OFS=',' "$input" > "$output"
}
# Use shift for
processing multiple arguments
while [ $# -gt 0 ]; do
case "$1" in
-f|--file)
FILE="$2"
shift 2
;;
-d|--debug)
DEBUG=1
shift
;;
*)
echo "Unknown
option: $1"
exit 1
;;
esac
done
29. How to debug
shell scripts and handle errors?
Answer:
# Debug
options
#!/bin/bash
set -x # Print
commands as executed
set -e # Exit
on any error
set -u # Exit
on undefined variable
set -o pipefail # Exit
on pipe failure
# Combined options
set -euo pipefail
# Debug specific
sections
set -x
# Debug this
section
set +x
# Error handling
error_exit() {
echo "Error: $1" >&2
exit 1
}
# Validate file
[ -f "$INPUT_FILE" ] || error_exit "Input file not
found: $INPUT_FILE"
# Check command
success
if ! grep "pattern" file.txt > /dev/null; then
echo "Pattern not
found"
exit 1
fi
# Trap errors
trap 'echo "Error at
line $LINENO"' ERR
trap 'cleanup_temp_files' EXIT
# Test file
operations
if [ -f "$file" ]; then
echo "File
exists"
elif [ -d "$file" ]; then
echo "Directory
exists"
else
echo "File/directory
not found"
fi
30. How to work
with archives and compression?
Answer:
#
Create archives
tar -czf archive.tar.gz
/data/folder/ #
Create compressed tar
tar -cf archive.tar
/data/folder/ #
Create uncompressed tar
zip -r archive.zip
/data/folder/ #
Create zip archive
# Extract archives
tar -xzf archive.tar.gz #
Extract compressed tar
tar -xf archive.tar #
Extract uncompressed tar
unzip archive.zip #
Extract zip
# List archive
contents
tar -tzf archive.tar.gz # List
tar.gz contents
tar -tf archive.tar # List
tar contents
unzip -l archive.zip # List
zip contents
# Compress
individual files
gzip file.txt #
Creates file.txt.gz
gunzip file.txt.gz #
Extracts to file.txt
compress file.txt #
Creates file.txt.Z
uncompress file.txt.Z #
Extracts to file.txt
# ETL archive
examples
# Daily data
archival
DATE=$(date +%Y%m%d)
tar -czf
/archive/daily_data_$DATE.tar.gz
/data/processed/
find /data/processed -type f -mtime +7 -delete
# Compress log
files
gzip /logs/*.log
find /logs -name "*.log" -mtime +1 -exec gzip {} \;
Network and Data
Transfer (Questions 31-40)
31. How to
transfer files between systems?
Answer:
# SCP
(Secure Copy)
scp file.txt
user@server:/remote/path/ # Copy
file to remote
scp
user@server:/remote/file.txt ./
# Copy file from remote
scp -r /local/dir
user@server:/remote/dir/ # Copy
directory recursively
# RSYNC
(synchronization)
rsync -av /local/dir/
user@server:/remote/dir/ # Sync
directories
rsync -av --delete source/
destination/ # Sync
and delete extras
rsync -av --exclude='*.log' source/ dest/ # Exclude patterns
# FTP/SFTP
sftp user@server
> put local_file.txt remote_file.txt
> get remote_file.txt local_file.txt
> mput *.csv #
Upload multiple files
> mget *.txt #
Download multiple files
# Wget/Curl
wget
http://server.com/file.csv
# Download file
curl -O
http://server.com/file.csv #
Download with curl
curl -T upload.csv
ftp://server.com/path/ #
Upload via FTP
# ETL data
transfer example
#!/bin/bash
SOURCE_SERVER="data.company.com"
SOURCE_PATH="/exports/daily"
LOCAL_PATH="/data/incoming"
rsync -av --include='*.csv' --exclude='*' \
user@$SOURCE_SERVER:$SOURCE_PATH/ $LOCAL_PATH/
32. How to test
network connectivity and services?
Answer:
# Basic
connectivity
ping server.com # Test
reachability
ping -c 4 192.168.1.1 # Send
4 packets
# Port testing
telnet server.com 80 # Test
specific port
nc -zv server.com 80 # Test
port with netcat
nmap -p 80,443,22
server.com # Scan
multiple ports
# Service testing
curl -I
http://server.com # HTTP
header check
wget --spider
http://server.com # Test
without downloading
# DNS resolution
nslookup server.com # DNS
lookup
dig server.com #
Detailed DNS info
host server.com #
Simple DNS lookup
# ETL connectivity
script
#!/bin/bash
DB_HOST="db.company.com"
DB_PORT="1521"
WEB_SERVICE="http://api.company.com/health"
# Test database
connectivity
if nc -zv $DB_HOST $DB_PORT; then
echo "Database
connection: OK"
else
echo "Database
connection: FAILED"
fi
# Test web service
if curl -f -s $WEB_SERVICE > /dev/null; then
echo "Web service:
OK"
else
echo "Web service:
FAILED"
fi
33. How to
monitor logs in real-time?
Answer:
#
Real-time log monitoring
tail -f
/logs/application.log #
Follow single log
tail -F /logs/app.log #
Follow with log rotation
multitail log1.txt
log2.txt #
Monitor multiple logs
# Filter while
monitoring
tail -f /logs/etl.log | grep "ERROR"
tail -f /logs/app.log | grep --color=always "ERROR\|WARN\|INFO"
# Multiple file
monitoring
tail -f /logs/*.log #
Follow all log files
find /logs -name "*.log" | xargs tail -f
# Advanced
monitoring with alerts
tail -f /logs/etl.log | while read line; do
if echo "$line" | grep -q "FATAL"; then
echo "CRITICAL
ERROR: $line" | mail -s "ETL
Alert" admin@company.com
fi
done
# Monitor log file
size
while true; do
for log in /logs/*.log; do
size=$(du -m "$log" | cut -f1)
if [ $size -gt 100 ]; then
echo "Large log
file: $log (${size}MB)"
fi
done
sleep 300
done
34. How to parse
and analyze log files?
Answer:
#
Extract time ranges
awk '/2024-01-15
09:00/,/2024-01-15 17:00/' application.log
# Count log levels
grep -c "ERROR" /logs/app.log
grep -c "WARN" /logs/app.log
awk '/ERROR/{err++}
/WARN/{warn++} /INFO/{info++}
END {print "Errors:", err,
"Warnings:", warn, "Info:", info}' /logs/app.log
# Find most
frequent errors
grep "ERROR" /logs/app.log | cut -d' ' -f4- | sort | uniq -c | sort -nr
# Extract specific
patterns
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' access.log | sort | uniq -c
awk '{print $1}' access.log | sort | uniq -c | sort -nr # Top IP addresses
# ETL log analysis
#!/bin/bash
LOG_FILE="/logs/etl_$(date +%Y%m%d).log"
echo "ETL Process
Summary for $(date +%Y-%m-%d)"
echo "=========================================="
echo "Total log
entries: $(wc -l < $LOG_FILE)"
echo "Errors: $(grep -c ERROR $LOG_FILE)"
echo "Warnings: $(grep -c WARN $LOG_FILE)"
echo "Successful
processes: $(grep -c SUCCESS $LOG_FILE)"
echo ""
echo "Top 5 Error
Messages:"
grep ERROR $LOG_FILE | awk -F'ERROR' '{print $2}' | sort | uniq -c | sort -nr | head -5
35. How to work
with databases from command line?
Answer:
# MySQL
client
mysql -h hostname -u username -p database_name
mysql -e "SELECT
COUNT(*) FROM table_name;" database_name
# PostgreSQL
client
psql -h hostname -U username -d database_name
psql -c "SELECT * FROM
table_name LIMIT 10;" database_name
# Oracle SQL*Plus
sqlplus
username/password@hostname:port/service_name
sqlplus -s
username/password@database << EOF
SELECT COUNT(*) FROM
table_name;
EXIT;
EOF
# Export data
mysqldump -h host -u user -p database_name > backup.sql
pg_dump -h host -U user database_name > backup.sql
# ETL database
operations
#!/bin/bash
DB_HOST="localhost"
DB_USER="etl_user"
DB_NAME="warehouse"
# Check row counts
mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "
SELECT
table_name,
table_rows
FROM
information_schema.tables
WHERE table_schema='$DB_NAME'
ORDER BY table_rows
DESC;" $DB_NAME
# Validate data
load
EXPECTED_COUNT=10000
ACTUAL_COUNT=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -s -e "SELECT
COUNT(*) FROM customer;" $DB_NAME)
if [ $ACTUAL_COUNT -eq $EXPECTED_COUNT ]; then
echo "Data load
validation: PASSED"
else
echo "Data load
validation: FAILED (Expected: $EXPECTED_COUNT, Actual: $ACTUAL_COUNT)"
fi
36. How to
automate file transfers and data synchronization?
Answer:
#
Automated file transfer script
#!/bin/bash
SOURCE_DIR="/data/exports"
DEST_SERVER="backup.company.com"
DEST_DIR="/backup/daily"
LOG_FILE="/logs/transfer_$(date +%Y%m%d).log"
transfer_files() {
echo "$(date): Starting file
transfer" >> $LOG_FILE
# Use rsync for efficient
transfer
if rsync -av --progress $SOURCE_DIR/ user@$DEST_SERVER:$DEST_DIR/; then
echo "$(date): Transfer completed
successfully" >> $LOG_FILE
else
echo "$(date): Transfer
failed" >> $LOG_FILE
exit 1
fi
}
# Check network
connectivity first
if ping -c 1 $DEST_SERVER > /dev/null 2>&1; then
transfer_files
else
echo "$(date): Cannot reach
destination server" >> $LOG_FILE
exit 1
fi
# Bi-directional
sync
rsync -av --delete /local/data/
user@server:/remote/data/
rsync -av --delete
user@server:/remote/configs/ /local/configs/
# Scheduled in
cron
# 0 2 * * *
/scripts/transfer_files.sh > /dev/null 2>&1
37. How to
handle file encryption and security?
Answer:
# GPG
encryption
gpg --symmetric --cipher-algo AES256
sensitive_data.csv #
Encrypt with password
gpg --decrypt
sensitive_data.csv.gpg >
decrypted_data.csv #
Decrypt
# OpenSSL
encryption
openssl enc -aes-256-cbc -salt -in data.csv -out data.csv.enc -k password
openssl enc -aes-256-cbc -d -in data.csv.enc -out data.csv -k password
# SSH keys for
automated transfers
ssh-keygen -t rsa -b 4096 -C "etl@company.com"
ssh-copy-id user@server.com
# Secure file
permissions
chmod 600
~/.ssh/id_rsa #
Private key
chmod 644
~/.ssh/id_rsa.pub #
Public key
chmod 700 ~/.ssh # SSH
directory
# ETL secure
transfer
#!/bin/bash
SENSITIVE_FILE="customer_data.csv"
ENCRYPTED_FILE="${SENSITIVE_FILE}.gpg"
PASSPHRASE_FILE="/secure/passphrase.txt"
# Encrypt before
transfer
gpg --batch --yes --passphrase-file $PASSPHRASE_FILE \
--symmetric --cipher-algo AES256 $SENSITIVE_FILE
# Transfer
encrypted file
scp $ENCRYPTED_FILE
user@server:/secure/location/
# Clean up local
encrypted file
rm $ENCRYPTED_FILE
38. How to
monitor system performance for ETL processes?
Answer:
# CPU
and memory monitoring during ETL
#!/bin/bash
ETL_PROCESS="informatica\|datastage\|talend"
LOG_FILE="/logs/performance_$(date +%Y%m%d).log"
monitor_performance() {
while true; do
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
# Overall system stats
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//')
memory_usage=$(free | grep Mem | awk '{printf
"%.1f", $3/$2 * 100.0}')
# ETL process specific
etl_processes=$(pgrep -cf "$ETL_PROCESS")
etl_memory=$(ps aux | grep -E "$ETL_PROCESS" | awk '{sum+=$6} END
{print sum/1024}')
# Disk I/O
disk_io=$(iostat -x 1 2 | tail -1 | awk '{print $4, $5}')
echo "$timestamp,CPU:${cpu_usage}%,Memory:${memory_usage}%,ETL_Processes:$etl_processes,ETL_Memory:${etl_memory}MB,DiskIO:$disk_io" >> $LOG_FILE
sleep 60
done
}
# Alert on high
resource usage
check_thresholds() {
cpu_limit=80
memory_limit=85
current_cpu=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//' | cut -d'.' -f1)
current_memory=$(free | grep Mem | awk '{printf
"%.0f", $3/$2 * 100.0}')
if [ $current_cpu -gt $cpu_limit ]; then
echo "High CPU
usage: ${current_cpu}%" | mail -s "Performance
Alert" admin@company.com
fi
if [ $current_memory -gt $memory_limit ]; then
echo "High memory
usage: ${current_memory}%" | mail -s "Performance
Alert" admin@company.com
fi
}
39. How to
create data validation and quality checks?
Answer:
# File
validation script
#!/bin/bash
validate_data_file() {
local file="$1"
local expected_columns="$2"
local min_rows="$3"
echo "Validating
file: $file"
# Check if file exists
if [ ! -f "$file" ]; then
echo "ERROR: File
not found: $file"
return 1
fi
# Check if file is not
empty
if [ ! -s "$file" ]; then
echo "ERROR: File is
empty: $file"
return 1
fi
# Check column count
actual_columns=$(head -1 "$file" | awk -F',' '{print NF}')
if [ $actual_columns -ne $expected_columns ]; then
echo "ERROR:
Expected $expected_columns columns, found $actual_columns"
return 1
fi
# Check minimum row count
(excluding header)
actual_rows=$(tail -n +2 "$file" | wc -l)
if [ $actual_rows -lt $min_rows ]; then
echo "ERROR:
Expected at least $min_rows rows, found $actual_rows"
return 1
fi
# Check for empty fields
empty_fields=$(grep -c ',,' "$file")
if [ $empty_fields -gt 0 ]; then
echo "WARNING: Found
$empty_fields rows with empty
fields"
fi
echo "SUCCESS: File
validation passed"
return 0
}
# Data quality
checks
check_data_quality() {
local file="$1"
echo "Data Quality
Report for $file"
echo "================================"
echo "Total records:
$(tail -n +2 "$file" | wc -l)"
echo "Duplicate
records: $(tail -n +2 "$file" | sort | uniq -d | wc -l)"
echo "Records with
null values: $(grep -c ',,' "$file")"
# Check specific data
patterns
echo "Invalid email
formats: $(awk -F',' '$3 !~ /@/
&& $3 != "" {count++} END {print count+0}' "$file")"
echo "Invalid phone
numbers: $(awk -F',' '$4 !~ /^[0-9-+()\
]+$/ && $4 != "" {count++} END {print count+0}' "$file")"
}
# Usage
validate_data_file "customers.csv" 5 1000
check_data_quality "customers.csv"
40. How to
create automated backup and recovery procedures?
Answer:
#
Automated backup script
#!/bin/bash
BACKUP_BASE="/backup"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30
create_backup() {
local source_dir="$1"
local backup_name="$2"
backup_dir="$BACKUP_BASE/$backup_name/$DATE"
mkdir -p "$backup_dir"
echo "$(date): Starting backup of
$source_dir"
# Create compressed backup
if tar -czf "$backup_dir/data.tar.gz" -C "$source_dir" .; then
echo "$(date): Backup completed
successfully"
# Create checksum for
verification
md5sum "$backup_dir/data.tar.gz" > "$backup_dir/checksum.md5"
# Log backup info
echo "$DATE,$source_dir,$backup_dir,$(du -sh $backup_dir/data.tar.gz | cut -f1)" >> "$BACKUP_BASE/backup.log"
else
echo "$(date): Backup
failed"
exit 1
fi
}
# Cleanup old
backups
cleanup_old_backups() {
local backup_path="$1"
find "$backup_path" -type d -mtime +$RETENTION_DAYS -exec rm -rf {} +
echo "$(date): Cleaned up backups
older than $RETENTION_DAYS days"
}
# Database backup
backup_database() {
local db_name="$1"
local backup_dir="$BACKUP_BASE/database/$DATE"
mkdir -p "$backup_dir"
# MySQL backup
mysqldump -h localhost -u backup_user -p$DB_PASSWORD $db_name | gzip > "$backup_dir/${db_name}.sql.gz"
# Verify backup
if [ -s "$backup_dir/${db_name}.sql.gz" ]; then
echo "$(date): Database backup
completed"
else
echo "$(date): Database backup
failed"
exit 1
fi
}
# Recovery
procedure
restore_from_backup() {
local backup_file="$1"
local restore_location="$2"
echo "$(date): Starting restore
from $backup_file"
# Verify checksum
if md5sum -c "$(dirname $backup_file)/checksum.md5"; then
echo "Checksum
verification passed"
# Extract backup
tar -xzf "$backup_file" -C "$restore_location"
echo "$(date): Restore
completed"
else
echo "ERROR:
Checksum verification failed"
exit 1
fi
}
# Execute backups
create_backup "/data/etl" "etl_data"
backup_database "warehouse"
cleanup_old_backups "$BACKUP_BASE"
Advanced ETL
Operations (Questions 41-50)
41. How to
implement file watching and automatic processing?
Answer:
# Using
inotify to watch for new files
#!/bin/bash
WATCH_DIR="/data/incoming"
PROCESS_SCRIPT="/opt/etl/process_file.sh"
LOG_FILE="/logs/file_watcher.log"
# Install
inotify-tools if needed: yum install inotify-tools
watch_directory() {
echo "$(date): Starting file
watcher for $WATCH_DIR" >> $LOG_FILE
inotifywait -m -e
close_write,moved_to --format '%w%f %e' "$WATCH_DIR" | while read file event; do
# Only process CSV files
if [[ "$file" == *.csv ]]; then
echo "$(date): New file detected:
$file ($event)" >> $LOG_FILE
# Wait
a moment to ensure file is completely written
sleep 2
#
Process the file
if $PROCESS_SCRIPT "$file"; then
echo "$(date): Successfully
processed $file" >> $LOG_FILE
mv "$file" "/data/processed/"
else
echo "$(date): Failed to process $file" >> $LOG_FILE
mv "$file" "/data/error/"
fi
fi
done
}
# Alternative
using simple polling
poll_directory() {
while true; do
for file in "$WATCH_DIR"/*.csv; do
if [ -f "$file" ] && [ ! -f "$file.processing" ]; then
# Mark
file as being processed
touch "$file.processing"
echo "$(date): Processing $file" >> $LOG_FILE
if $PROCESS_SCRIPT "$file"; then
mv "$file" "/data/processed/"
rm "$file.processing"
else
mv "$file" "/data/error/"
rm "$file.processing"
fi
fi
done
sleep 30
done
}
# Start watching
watch_directory
42. How to
implement data reconciliation and comparison?
Answer:
# Data
reconciliation script
#!/bin/bash
reconcile_data() {
local source_file="$1"
local target_file="$2"
local key_column="$3"
local report_file="$4"
echo "Data
Reconciliation Report" > "$report_file"
echo "=========================" >> "$report_file"
echo "Source: $source_file" >> "$report_file"
echo "Target: $target_file" >> "$report_file"
echo "Date: $(date)" >> "$report_file"
echo "" >> "$report_file"
# Count records
source_count=$(tail -n +2 "$source_file" | wc -l)
target_count=$(tail -n +2 "$target_file" | wc -l)
echo "Record
Counts:" >> "$report_file"
echo "Source: $source_count" >> "$report_file"
echo "Target: $target_count" >> "$report_file"
echo "Difference: $((source_count - target_count))" >> "$report_file"
echo "" >> "$report_file"
# Extract keys for
comparison
tail -n +2 "$source_file" | cut -d',' -f"$key_column" | sort >
/tmp/source_keys.txt
tail -n +2 "$target_file" | cut -d',' -f"$key_column" | sort >
/tmp/target_keys.txt
# Find missing records
echo "Records in
source but not in target:" >> "$report_file"
comm -23
/tmp/source_keys.txt /tmp/target_keys.txt >> "$report_file"
echo "" >> "$report_file"
echo "Records in
target but not in source:" >> "$report_file"
comm -13
/tmp/source_keys.txt /tmp/target_keys.txt >> "$report_file"
echo "" >> "$report_file"
# Calculate match
percentage
common_count=$(comm -12
/tmp/source_keys.txt /tmp/target_keys.txt | wc -l)
if [ $source_count -gt 0 ]; then
match_percentage=$(echo "scale=2; $common_count * 100 / $source_count" | bc)
echo "Match
Percentage: ${match_percentage}%" >> "$report_file"
fi
# Cleanup
rm -f
/tmp/source_keys.txt /tmp/target_keys.txt
}
# Compare file
checksums
compare_file_integrity() {
local file1="$1"
local file2="$2"
checksum1=$(md5sum "$file1" | cut -d' ' -f1)
checksum2=$(md5sum "$file2" | cut -d' ' -f1)
if [ "$checksum1" = "$checksum2" ]; then
echo "Files are
identical"
return 0
else
echo "Files
differ"
echo "File1
checksum: $checksum1"
echo "File2
checksum: $checksum2"
return 1
fi
}
# Usage
reconcile_data "source_data.csv" "target_data.csv" 1 "reconciliation_report.txt"
43. How to
implement error handling and retry mechanisms?
Answer:
# Retry
mechanism with exponential backoff
#!/bin/bash
retry_with_backoff() {
local max_attempts="$1"
local delay="$2"
local command="$3"
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Attempt $attempt of $max_attempts: $command"
if eval "$command"; then
echo "Command
succeeded on attempt $attempt"
return 0
else
echo "Command failed
on attempt $attempt"
if [ $attempt -lt $max_attempts ]; then
echo "Waiting $delay seconds before
retry..."
sleep $delay
delay=$((delay * 2)) # Exponential backoff
fi
attempt=$((attempt + 1))
fi
done
echo "Command failed
after $max_attempts attempts"
return 1
}
# ETL process with
error handling
process_with_error_handling() {
local input_file="$1"
local output_file="$2"
local error_file="$3"
local log_file="$4"
# Function to log messages
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S'): $1" | tee -a "$log_file"
}
# Error cleanup function
cleanup_on_error() {
log_message "ERROR: Process
failed, cleaning up temporary files"
rm -f /tmp/process_$$_*
mv "$input_file" "$error_file" 2>/dev/null
}
# Set error trap
trap cleanup_on_error
ERR
log_message "Starting
process for $input_file"
# Validate input
if [ ! -f "$input_file" ]; then
log_message "ERROR: Input
file not found: $input_file"
return 1
fi
# Process with retry
if retry_with_backoff 3 5 "your_etl_command
$input_file $output_file"; then
log_message "SUCCESS:
Process completed for $input_file"
mv "$input_file" "/data/processed/"
else
log_message "ERROR: Process
failed after retries for $input_file"
cleanup_on_error
return 1
fi
}
# Circuit breaker
pattern
circuit_breaker() {
local service_name="$1"
local command="$2"
local failure_threshold=5
local timeout=300
local state_file="/tmp/circuit_breaker_${service_name}"
# Check circuit state
if [ -f "$state_file" ]; then
last_failure=$(cat "$state_file")
current_time=$(date +%s)
if [ $((current_time - last_failure)) -lt $timeout ]; then
echo "Circuit
breaker OPEN for $service_name"
return 1
fi
fi
# Try to execute command
if eval "$command"; then
rm -f "$state_file" # Reset on success
return 0
else
echo "$(date +%s)" > "$state_file"
echo "Circuit
breaker triggered for $service_name"
return 1
fi
}
44. How to
implement parallel processing for large datasets?
Answer:
#
Parallel file processing
#!/bin/bash
process_files_parallel() {
local input_dir="$1"
local output_dir="$2"
local max_jobs="$3"
local job_count=0
# Function to process a
single file
process_single_file() {
local file="$1"
local output_dir="$2"
local filename=$(basename "$file")
echo "Processing $filename..."
# Your processing logic
here
awk -F',' '{print $1,$2,$3}' OFS=',' "$file" > "$output_dir/$filename"
echo "Completed $filename"
}
# Export function for
parallel execution
export -f process_single_file
# Process files in parallel
find "$input_dir" -name "*.csv" | \
xargs -
No comments:
Post a Comment