Skip to main content
T

Sr. Site Reliability Engineer

The Value Maximizer

Location

Austin, TX

Salary

Not specified

Type

fulltime

Posted

Today

via linkedin

Job Description

Role:

Sr. Site Reliability Engineer (SRE) - Unified Observability \& AIOps

Location:

Austin, TX / Fort Mill, SC (Hybrid)Job Type: Full Time

Role Summary

We are seeking a

Senior SRE

with strong expertise in

Unified Observability, proactive detection, AIOps, and GenAI-driven operations

to support complex, distributed financial services platforms. The role requires hands-on experience designing

SLI/SLO-driven monitoring

,

dynamic thresholds

,

intelligent alerting

, and

AI/ML-based anomaly detection

across multi-stream architectures.

Key Responsibilities

Observability \& Reliability Engineering

  • Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
  • Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
  • Build actionable dashboards for operations, engineering, and leadership
  • Implement alerting strategies using static and dynamic thresholds

Proactive Detection \& AIOps

  • Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
  • Transition monitoring from reactive alerts to proactive insights
  • Implement noise reduction, alert correlation, and root cause analysis
  • Apply baseline modeling, seasonality detection, and anomaly scoring

Distributed Systems \& Dependency Analysis

  • Monitor and troubleshoot multi-service architectures involving:
  • Microservices
  • Downstream APIs
  • Kafka / streaming platforms
  • Cloud infrastructure (Terraform, IaC)
  • Identify whether issues originate from:
  • Upstream/downstream dependencies
  • Streaming platform
  • Infrastructure
  • Application code

Tooling \& Platforms

  • Deep hands-on experience with Dynatrace (mandatory)
  • Experience with:
  • OpenTelemetry
  • Prometheus / Grafana
  • ELK / EFK
  • Cloud-native monitoring (AWS/Azure/GCP)
  • Strong JSON-based telemetry manipulation and enrichment

GenAI \& LLM Enablement

  • Apply GenAI / LLMs for:
  • Incident summarization
  • Root cause explanation
  • Runbook recommendations
  • Auto-remediation suggestions
  • Collaborate with platform teams to operationalize GenAI safely

Required Skills \& Experience

✅ 15\+ years in SRE / Production Engineering ✅ Strong

Unified Observability

background (not infra-only) ✅ Hands-on

Dynatrace

experience (metrics, traces, logs, Davis AI) ✅ SLI/SLO engineering experience in production systems ✅ Experience implementing

dynamic thresholds

and anomaly detection ✅ Knowledge of

AI/ML concepts applied to Ops (AIOps)

✅ Distributed systems troubleshooting expertise ✅ Experience with Kafka or streaming data platforms

Differentiators (Highly Valued)

  • Experience in financial services or regulated environments
  • Proven reduction of alert noise and MTTR using AIOps
  • GenAI / LLM integration into operations workflows

Looking for more opportunities?

Browse thousands of graduate jobs and entry-level positions.

Browse All Jobs