Location
Chandler, AZ
Salary
Not specified
Type
contract
Posted
Today
Job Description
Site Reliability Engineer (SRE)
Chandler, AZ
$60-$70/hour
Hybrid 3 days onsite 2 days remote
18 Month W2 Contract
The individual in this role is responsible for directly partnering with Application Development and Production Support teams to implement the measures prescribed through the collaboration of the Site Reliability Engineer (SRE) Lead or Senior SRE and their partners. This individual will ensure the appropriate instrumentation, tooling, ticketing, alerting and on-call routines are in place for key services. This role will be engaged in production triage efforts and work with Problem Management in the identification of root cause for issues as required, using the knowledge gained in those efforts to partner closely with the Senior SRE to address any gaps in the reliability measurements and dashboards. This role will also focus heavily on software development activities, with a focus toward delivering automated solutions to eliminate ‘toil’ and suggest code enhancements to the Application Development teams.
Key responsibilities
- Collaborate with Development and Infrastructure teams to understand technical solutions and to implement the monitoring capabilities outlined in the application and system monitoring designs put forward by the SRE Lead.
- Mentor SRE resources on reliability practices and established tools/capabilities.
- Develop and maintain a catalog of extensible reliability scripts, tools and libraries that can be leveraged for common instrumentation, automation, and operational needs.
- Partner to implement code changes to make use of common reliability libraries and tools and help Application Production Services (APS) and Application Development teammates understand how to use them.Partner with infrastructure engineers and application teams to implement the necessary code changes to make use of common reliability libraries and tools and help the APS and Application Development teammates understand how to use them.
- Engage as a subject matter expert (SME) in major incident triage efforts, failure scenario modelling and work with Problem Manager to diagnose root causes for major incident / problem management investigations.
- Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and to help define solutions to reduce manual support effort and/or improve system reliability. Participate regularly in an on-call rotation with Production Support teammates to learn more about reliability issues affecting their portfolio
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.