K\&K Global Talent Solutions Inc.

is an international recruiting agency that has been providing technical resources in the Canada and the USA region since 1993\.

This position is with one of our clients in USA, who is actively hiring candidates to expand their teams.

Job Role: Fulltime Only

Exp: 10\+ Must

DevOps \& Site Reliability Lead

Job Description

Must Have Technical/Functional Skills

We are seeking a Site Reliability Engineer (SRE) with strong expertise in Talend and Big Data platforms to support and operate large-scale data processing environments.
The role requires close collaboration with customers, application teams, and offshore delivery teams to ensure platform reliability, incident management, and operational excellence. Experience with Databricks is a strong plus.

Key Responsibilities

Act as an SRE for Big Data and ETL platforms, ensuring high availability, performance, and reliability of data pipelines and applications.
Provide operational support and incident management (MIM), including triage, root cause analysis, and resolution of production issues.
Serve as a primary point of contact for customers, providing timely updates, issue resolution, and operational insights.
Collaborate closely with application teams to support ETL jobs, data processing workflows, and platform enhancements.
Coordinate with offshore teams for day-to-day operations, incident resolution, and continuous improvement initiatives.
Monitor, troubleshoot, and optimize Talend, Hadoop, Spark, and Big Data ecosystems.
Implement and support monitoring, alerting, runbooks, and automation to improve platform stability and reduce manual effort.
Participate in problem management, change management, and post-incident reviews to drive preventive measures.
Support capacity planning, performance tuning, and reliability improvements across the data landscape.

Required Skills \& Qualifications

Strong hands-on experience with Talend (development, support, and troubleshooting).
Solid understanding of Big Data technologies, including:

o Hadoop ecosystem

o Apache Spark

Proven experience handling Major Incident Management (MIM) and production support in a 24x7 or on-call environment.
Experience working directly with customers, business stakeholders, and cross-functional teams.
Strong coordination skills to manage and guide offshore teams.
Knowledge of ITIL processes, especially Incident, Problem, and Change Management.
Excellent communication, documentation, and stakeholder management skills.

Roles \& Responsibilities

Act as an SRE for Big Data and ETL platforms, ensuring high availability, performance, and reliability of data pipelines and applications.
Provide operational support and incident management (MIM), including triage, root cause analysis, and resolution of production issues.
Serve as a primary point of contact for customers, providing timely updates, issue resolution, and operational insights.
Collaborate closely with application teams to support ETL jobs, data processing workflows, and platform enhancements.
Coordinate with offshore teams for day-to-day operations, incident resolution, and continuous improvement initiatives.
Monitor, troubleshoot, and optimize Talend, Hadoop, Spark, and Big Data ecosystems.

Implement and support monitoring, alerting, runbooks, and automation to improve platform stability and reduce manual effort.

Participate in problem management, change management, and post-incident reviews to drive preventive measures.
Support capacity planning, performance tuning, and reliability improvements across the data landscape.

DevOps Engineer

Job Description