Senior Data Engineer with 6+ years of experience in cloud-native data engineering, specializing in AWS, PySpark, Snowflake, and Hadoop. Proven track record of building scalable ETL pipelines, reducing data processing time by 63%, and optimizing cloud data platforms for high-volume analytics. Strong expertise in data integration, automation, and distributed systems, with hands-on experience in Docker and modern data stacks.
Senior Data Engineer with 6+ years of experience in cloud-native data engineering, specializing in AWS, PySpark, Snowflake, and Hadoop. Proven track record of building scalable ETL pipelines, reducing data processing time by 63%, and optimizing cloud data platforms for high-volume analytics. Strong expertise in data integration, automation, and distributed systems, with hands-on experience in Docker and modern data stacks.
- Migrated enterprise data warehouse workloads from Snowflake to an internal AWS-based central data lake (S3 + Glue + EMR), reducing query latency by 45% and saving $200K annually.
- Automated daily ETL pipelines processing 50M+ records using Python and AWS Glue, achieving 99% data accuracy and reducing manual intervention by 90%.
- Implemented incremental loading strategies using AWS Glue and Spark, enabling near-zero downtime during migration and ensuring continuous access for end users.
- Developed a monitoring system in Python to detect anomalies in data flow across AWS and Snowflake, leading to a 30% decrease in downtime and ensuring data pipeline reliability.
- Implemented an AWS-based data pipeline processing 20M+ payment records/day, achieving a 63% reduction in processing time through pipeline optimization.
- Utilized PySpark Data Frames to process extensive payment data, achieving a 37% reduction in job execution time and a 56% decrease in resource consumption through optimized partitioning, caching, and broadcast variable utilization.
- Designed and optimized Snowflake schemas (star/snowflake), leveraging micro-partitioning, clustering keys, and secure data sharing for analytics.
- Increased data accessibility for data analysts by 30% by providing clear and consistent access to processed data in Redshift.
- Integrated data sources into AWS Glue, optimized DynamoDB, automated S3 backups with Boto3, and prototyped CI/CD with Jenkins.
- Designed comprehensive ETL pipelines utilizing FastAPI to streamline data flow between systems, resulting in reduced processing times of incoming datasets by an average of two hours daily without compromising integrity.
- Leveraged Docker to expedite development processes, ensuring rapid iteration and seamless environment reproducibility.
- Analyzed application performance metrics using Splunk, diagnosing 15 critical bottlenecks in real-time data processing; implemented system enhancements that increased application uptime to 99.9% and improved user satisfaction ratings.
- Optimized data processing of 6 million data records per day, improving the application's performance by 1.5x speed to generate reports. Validated code using pytest, maintaining test coverage at 80%, ensuring robust data pipelines.
- Implemented algorithms like String matching, Rule Engine, and N-gram generation to classify the Protection Group using the reference data & extract the data from various sources using Python ML libraries.
- Used various data preprocessing techniques to enrich the accuracy of the dataset and remove outliers.
- Developed a Rule Engine in Python to apply business rules to various statements and an N-gram match algorithm using NLTK to compare sentences.
- Saved 120 man-hours by developing an API to automate classification using Python and a conventional Rule-based approach.
- Achieved 81% accuracy by using an innovative rule engine and attained more than 15% accuracy compared with SME results.
- Upgraded SQL Server infrastructure, migrating 33 servers from Unix to Linux, ensuring a seamless transition and improved system performance by 33%.
- Developed views for replication, procedures, triggers, and cron jobs for scheduling tasks, using SCP to transfer files and scripts between servers, resulting in streamlined operations and reduced manual intervention.
- Optimized query performance by analyzing execution plans and implementing appropriate indexing strategies, leading to a 28% reduction in query execution time.