Location
Deerfield, IL
Salary
Not specified
Type
fulltime
Posted
Today
via linkedin
Job Description
K\&K Global Talent Solutions Inc.
is an international recruiting agency that has been providing technical resources in the Canada and the USA region since 1993\.
This position is with one of our clients in USA, who is actively hiring candidates to expand their teams.
Job Role: Fulltime Only
Exp: 10\+ Must
DevOps \& Site Reliability Lead
Job Description
Must Have Technical/Functional Skills
- We are seeking a Site Reliability Engineer (SRE) with strong expertise in Talend and Big Data platforms to support and operate large-scale data processing environments.
- The role requires close collaboration with customers, application teams, and offshore delivery teams to ensure platform reliability, incident management, and operational excellence. Experience with Databricks is a strong plus.
Key Responsibilities
- Act as an SRE for Big Data and ETL platforms, ensuring high availability, performance, and reliability of data pipelines and applications.
- Provide operational support and incident management (MIM), including triage, root cause analysis, and resolution of production issues.
- Serve as a primary point of contact for customers, providing timely updates, issue resolution, and operational insights.
- Collaborate closely with application teams to support ETL jobs, data processing workflows, and platform enhancements.
- Coordinate with offshore teams for day-to-day operations, incident resolution, and continuous improvement initiatives.
- Monitor, troubleshoot, and optimize Talend, Hadoop, Spark, and Big Data ecosystems.
- Implement and support monitoring, alerting, runbooks, and automation to improve platform stability and reduce manual effort.
- Participate in problem management, change management, and post-incident reviews to drive preventive measures.
- Support capacity planning, performance tuning, and reliability improvements across the data landscape.
Required Skills \& Qualifications
- Strong hands-on experience with Talend (development, support, and troubleshooting).
- Solid understanding of Big Data technologies, including:
o Hadoop ecosystem
o Apache Spark
- Proven experience handling Major Incident Management (MIM) and production support in a 24x7 or on-call environment.
- Experience working directly with customers, business stakeholders, and cross-functional teams.
- Strong coordination skills to manage and guide offshore teams.
- Knowledge of ITIL processes, especially Incident, Problem, and Change Management.
- Excellent communication, documentation, and stakeholder management skills.
Roles \& Responsibilities
- Act as an SRE for Big Data and ETL platforms, ensuring high availability, performance, and reliability of data pipelines and applications.
- Provide operational support and incident management (MIM), including triage, root cause analysis, and resolution of production issues.
- Serve as a primary point of contact for customers, providing timely updates, issue resolution, and operational insights.
- Collaborate closely with application teams to support ETL jobs, data processing workflows, and platform enhancements.
- Coordinate with offshore teams for day-to-day operations, incident resolution, and continuous improvement initiatives.
- Monitor, troubleshoot, and optimize Talend, Hadoop, Spark, and Big Data ecosystems.
Implement and support monitoring, alerting, runbooks, and automation to improve platform stability and reduce manual effort.
- Participate in problem management, change management, and post-incident reviews to drive preventive measures.
- Support capacity planning, performance tuning, and reliability improvements across the data landscape.
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.