Location
Remote
Salary
Not specified
Type
Full-time
Posted
Today
Job Description
Site Reliability Engineer (SRE) / Platform Engineer
Location:
Reston, VA (Hybrid — 2 days onsite / 3 days remote)
Employment Type:
Full-time
About the Organization
Join a
mission-driven, national financial services organization
at the heart of the U.S. housing finance ecosystem. This is a
mid-sized, highly regulated enterprise
operating at market scale—supporting platforms and analytics that enable
trillions of dollars in annual economic activity
. You’ll work in a modern tech environment with strong engineering partners, clear business impact, and a mandate for reliability, security, and continuous improvement.
The Role
Our client is hiring a
hands-on SRE / Platform Engineer
to operate, tune, and scale our
OpenShift/Kubernetes
platforms while
bridging on-prem to Azure
to power our analytics ecosystem. You’ll own reliability, automation, and observability across a hybrid estate—partnering closely with developers, data engineers, infrastructure operations, and security to deliver secure, performant platform services using modern DevSecOps practices.
Why This Role Stands Out
- Hybrid impact:
Operate critical OpenShift clusters
and
manage Azure services used by data and analytics teams.
- Hybrid architecture:
Help design and support the
bridge from on-prem to cloud
—migration, integration, and steady-state operations.
- Real-world scale:
Reliability work that directly supports
high-volume financial market operations
and enterprise analytics.
- Automation-first:
Lean into
Terraform, Ansible, and GitOps
to make reliability repeatable.
What You’ll Do the First 180 Days...
- Operate, tune, and optimize
OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies).
- Stand up and/or refine
observability
(Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks.
- Map current
hybrid topology
and critical delivery pipelines; identify toil and prioritize automation (Terraform/Ansible).
- Begin supporting
Azure
environments (compute, networking, storage, data services) used by analytics teams.
- Drive
GitOps-first
workflows; harden CI/CD with
ArgoCD/Jenkins/GitHub Actions
and policy-as-code guardrails.
- Implement or enhance
platform services
(Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams.
- Lead incident response and postmortems
; institutionalize RCA, blameless learning, and continuous improvement.
- Advance the
hybrid service model
—migrations, integrations, reliability/latency tuning, cost and performance optimization.
Day-to-Day Responsibilities
- Operate and optimize
OpenShift/Kubernetes
clusters, ingress (e.g., Nginx), and container networking/service mesh.
- Manage
Azure
services (compute, VNet, storage, data services) supporting analytics workloads.
- Build and maintain
automated infrastructure
with
Terraform, Ansible, and GitOps
workflows.
- Implement and evolve
observability
(Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks.
- Design, harden, and support
delivery pipelines
with
ArgoCD/Jenkins/GitHub Actions
.
- Provide
platform tooling and enablement
for application developers, data engineers, and operations teams.
- Ensure
security and access management
(HashiCorp Vault, secrets management, least privilege).
- Lead incident response
, coordinate cross-functional resolution, and drive corrective actions and platform improvements.
- Script or develop tools in
Bash, Python, or Go
to eliminate toil and improve developer experience.
Tech You’ll Work With
- Kubernetes / OpenShift
- Azure
(compute, networking, storage, and data services)
- Automation \& IaC:
Terraform, Ansible, GitOps
- Observability:
Datadog, Prometheus, Grafana
- Networking \& Ingress:
Nginx, service meshes, container networking
- Messaging:
Kafka, AMQ
- Secrets \& Access:
HashiCorp Vault
- CI/CD:
ArgoCD, Jenkins, GitHub Actions
- Scripting/Coding:
Bash, Python, Go
Must-Have Qualifications
- 2\+ years
hands-on operating and managing
Kubernetes and OpenShift
clusters.
- Strong experience with
Microsoft Azure
(compute, networking, storage,
and
data services).
- Proven skills in
automation and Infrastructure-as-Code
(Terraform, Ansible, GitOps).
- Proficiency with
observability tooling
(Datadog, Prometheus, Grafana).
- Scripting/coding
ability in
Bash, Python, or Go
.
Preferred / Stand-Out Skills
- Experience
bridging on-prem and cloud
in a hybrid service model (migration, integration, optimization).
- Expertise with
Kafka/AMQ
,
HashiCorp Vault
, and
ArgoCD/Jenkins/GitHub Actions
.
- Background
leading incident response and postmortems
with strong RCA and continuous improvement practices.
Work Model \& Team
- Hybrid:
2 days onsite in
Reston, VA
; 3 days remote.
- You’ll be part of the
IT organization
, collaborating daily with
developers, data engineers, infrastructure operations, and security.
How to Succeed Here
- You’re a
hands-on engineer
who thrives in regulated, high-impact environments.
- You favor
automation over repetition
, and
observability over guesswork
.
- You collaborate openly, communicate clearly, and
leave systems better
than you found them.
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.