Senior Site Reliability Engineer

BillingPlatform is an industry-leading, fast-growing SaaS company. Our award-winning, cloud-based revenue lifecycle management platform is leveraged by leading global enterprises to automate and streamline the entire quote-to-cash process. At BillingPlatform, our employees are our most valuable asset, and we believe deeply in a culture of collaboration, accountability, innovation, and transparency. We seek bright, enthusiastic, and creative professionals looking to be part of our incredible team focused on challenging the status quo and driving transformational value to customers.

Backed by leading private equity firms FTV Capital and Columbia Capital, we have achieved remarkable industry recognition for growth, including being listed for the fifth consecutive year on Deloitte’s Technology Fast 500™ list of fastest-growing technology companies and ranked on the Inc 5000 list for four years running.

Our ability to innovate market-leading solutions has been validated by all major industry analyst firms, including being named a Leader in the first-ever Gartner® Magic Quadrant™ for Recurring Billing Applications, and being recognized as the Leader in Forrester Research’s “The Forrester Wave™: SaaS Recurring Billing Solutions.” To learn more about us, visit billingplatform.com.

Responsibilities

Own and improve on-call processes, incident response playbooks, and post-mortem culture
Define, track, and manage SLOs, SLIs, and error budgets for critical services
Lead blameless post-mortems and drive systematic reliability improvements
Respond to production incidents and coordinate cross-functional resolution
Design, build, and maintain scalable AWS infrastructure using IaC (Terraform, Pulumi)
Manage Kubernetes clusters and containerized workloads in production
Build and maintain CI/CD pipelines to improve deployment speed and reliability
Evaluate and implement tooling to enhance developer productivity and system stability
Implement monitoring, alerting, and distributed tracing (Prometheus, Grafana, Datadog, Jaeger)
Identify and resolve performance bottlenecks across services, networks, and databases
Build dashboards and runbooks for self-service operational insights
Partner with engineering teams to embed reliability practices (load testing, capacity planning, chaos engineering)
Conduct architecture reviews with a focus on reliability and operability

Qualifications

5\+ years of experience in SRE, DevOps, or infrastructure engineering
Deep expertise with AWS and cloud-native architectures
Strong experience with Kubernetes and container orchestration at scale
Hands-on experience with infrastructure-as-code tools (Terraform or Pulumi)
Proficiency in Python, Go, or Bash
Experience with observability tools (Prometheus, Grafana, Datadog, or similar)
Strong understanding of SLOs, SLIs, and error budgets
Experience with service mesh technologies (Istio, Linkerd)
Familiarity with chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos)
Background in Oracle database reliability and administration
Contributions to open-source infrastructure projects
Experience in a high-growth SaaS or product-led environment
Excellent English communication skills (written and spoken)

Benefits

Competitive compensation with a robust benefits package, including medical, dental, vision, LTD, HSA, FSA, free virtual mental health counseling, and health and wellness perks
Medical insurance coverage effective on the first day of employment
401(k) match that is 100% immediately vested
Discretionary and charitable time off program
Home office setup allowance for fully remote employees

Lead Site Reliability Engineer

Job Description

Looking for more opportunities?