We’re seeking an

AWS Site Reliability Engineer (SRE)

with strong

incident operations

experience to support and improve the reliability of cloud and data platform services across

AWS and Snowflake

This role is hands-on and operationally focused:

proactive monitoring, rapid incident response, service restoration, root cause analysis, and automation

to improve resilience and reduce MTTR.

What you’ll do

incident triage, coordination, and resolution

for AWS and Snowflake services in production

alerts, dashboards, and service health indicators

root cause analysis (RCA)

and drive post-incident remediation and continuous improvement

runbooks

, operational procedures, and on-call readiness

on-call rotations

(including operational handovers)

reduce MTTR

What you’ll bring (required)

EC2, S3, IAM, VPC, Lambda, CloudWatch

Snowflake administration and troubleshooting

CloudWatch, Datadog, Grafana, and/or Splunk

SLIs, SLOs, error budgets, incident management

Python, Bash, and/or Terraform

AWS Site Reliability Engineer

Job Description