Site Reliability Engineer (SRE) / Platform Engineer

Location:

Reston, VA (Hybrid — 2 days onsite / 3 days remote)

Employment Type:

Full-time

About the Organization

Join a

mission-driven, national financial services organization

at the heart of the U.S. housing finance ecosystem. This is a

mid-sized, highly regulated enterprise

operating at market scale—supporting platforms and analytics that enable

trillions of dollars in annual economic activity

. You’ll work in a modern tech environment with strong engineering partners, clear business impact, and a mandate for reliability, security, and continuous improvement.

The Role

Our client is hiring a

hands-on SRE / Platform Engineer

to operate, tune, and scale our

OpenShift/Kubernetes

platforms while

bridging on-prem to Azure

to power our analytics ecosystem. You’ll own reliability, automation, and observability across a hybrid estate—partnering closely with developers, data engineers, infrastructure operations, and security to deliver secure, performant platform services using modern DevSecOps practices.

Why This Role Stands Out

Hybrid impact:

Operate critical OpenShift clusters

and

manage Azure services used by data and analytics teams.

Hybrid architecture:

Help design and support the

bridge from on-prem to cloud

—migration, integration, and steady-state operations.

Real-world scale:

Reliability work that directly supports

high-volume financial market operations

and enterprise analytics.

Automation-first:

Lean into

Terraform, Ansible, and GitOps

to make reliability repeatable.

What You’ll Do the First 180 Days...

Operate, tune, and optimize

OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies).

Stand up and/or refine

observability

(Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks.

Map current

hybrid topology

and critical delivery pipelines; identify toil and prioritize automation (Terraform/Ansible).

Begin supporting

Azure

environments (compute, networking, storage, data services) used by analytics teams.

Drive

GitOps-first

workflows; harden CI/CD with

ArgoCD/Jenkins/GitHub Actions

and policy-as-code guardrails.

Implement or enhance

platform services

(Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams.

Lead incident response and postmortems

; institutionalize RCA, blameless learning, and continuous improvement.

Advance the

hybrid service model

—migrations, integrations, reliability/latency tuning, cost and performance optimization.

Day-to-Day Responsibilities

Operate and optimize

OpenShift/Kubernetes

clusters, ingress (e.g., Nginx), and container networking/service mesh.

Manage

Azure

services (compute, VNet, storage, data services) supporting analytics workloads.

Build and maintain

automated infrastructure

with

Terraform, Ansible, and GitOps

workflows.

Implement and evolve

observability

(Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks.

Design, harden, and support

delivery pipelines

with

ArgoCD/Jenkins/GitHub Actions

Provide

platform tooling and enablement

for application developers, data engineers, and operations teams.

Ensure

security and access management

(HashiCorp Vault, secrets management, least privilege).

Lead incident response

, coordinate cross-functional resolution, and drive corrective actions and platform improvements.

Script or develop tools in

Bash, Python, or Go

to eliminate toil and improve developer experience.

Tech You’ll Work With

Kubernetes / OpenShift
Azure

(compute, networking, storage, and data services)

Automation \& IaC:

Terraform, Ansible, GitOps

Observability:

Datadog, Prometheus, Grafana

Networking \& Ingress:

Nginx, service meshes, container networking

Messaging:

Kafka, AMQ

Secrets \& Access:

HashiCorp Vault

CI/CD:

ArgoCD, Jenkins, GitHub Actions

Scripting/Coding:

Bash, Python, Go

Must-Have Qualifications

2\+ years

hands-on operating and managing

Kubernetes and OpenShift

clusters.

Strong experience with

Microsoft Azure

(compute, networking, storage,

and

data services).

Proven skills in

automation and Infrastructure-as-Code

(Terraform, Ansible, GitOps).

Proficiency with

observability tooling

(Datadog, Prometheus, Grafana).

Scripting/coding

ability in

Bash, Python, or Go

Preferred / Stand-Out Skills

Experience

bridging on-prem and cloud

in a hybrid service model (migration, integration, optimization).

Expertise with

Kafka/AMQ

HashiCorp Vault

, and

ArgoCD/Jenkins/GitHub Actions

Background

leading incident response and postmortems

with strong RCA and continuous improvement practices.

Work Model \& Team

Hybrid:

2 days onsite in

Reston, VA

; 3 days remote.

You’ll be part of the

IT organization

, collaborating daily with

developers, data engineers, infrastructure operations, and security.

How to Succeed Here

You’re a

hands-on engineer

who thrives in regulated, high-impact environments.

You favor

automation over repetition

, and

observability over guesswork

You collaborate openly, communicate clearly, and

leave systems better

than you found them.

SRE/Platform Engineer (OpenShift/Kubernetes) 4660

Job Description

Looking for more opportunities?