Location
Denver, CO
Salary
Not specified
Type
fulltime
Posted
Today
Job Description
Senior Site Reliability Engineer
BillingPlatform is an industry-leading, fast-growing SaaS company. Our award-winning, cloud-based revenue lifecycle management platform is leveraged by leading global enterprises to automate and streamline the entire quote-to-cash process. At BillingPlatform, our employees are our most valuable asset, and we believe deeply in a culture of collaboration, accountability, innovation, and transparency. We seek bright, enthusiastic, and creative professionals looking to be part of our incredible team focused on challenging the status quo and driving transformational value to customers.
Backed by leading private equity firms FTV Capital and Columbia Capital, we have achieved remarkable industry recognition for growth, including being listed for the fifth consecutive year on Deloitte’s Technology Fast 500™ list of fastest-growing technology companies and ranked on the Inc 5000 list for four years running.
Our ability to innovate market-leading solutions has been validated by all major industry analyst firms, including being named a Leader in the first-ever Gartner® Magic Quadrant™ for Recurring Billing Applications, and being recognized as the Leader in Forrester Research’s “The Forrester Wave™: SaaS Recurring Billing Solutions.” To learn more about us, visit billingplatform.com.
Responsibilities
- Own and improve on-call processes, incident response playbooks, and post-mortem culture
- Define, track, and manage SLOs, SLIs, and error budgets for critical services
- Lead blameless post-mortems and drive systematic reliability improvements
- Respond to production incidents and coordinate cross-functional resolution
- Design, build, and maintain scalable AWS infrastructure using IaC (Terraform, Pulumi)
- Manage Kubernetes clusters and containerized workloads in production
- Build and maintain CI/CD pipelines to improve deployment speed and reliability
- Evaluate and implement tooling to enhance developer productivity and system stability
- Implement monitoring, alerting, and distributed tracing (Prometheus, Grafana, Datadog, Jaeger)
- Identify and resolve performance bottlenecks across services, networks, and databases
- Build dashboards and runbooks for self-service operational insights
- Partner with engineering teams to embed reliability practices (load testing, capacity planning, chaos engineering)
- Conduct architecture reviews with a focus on reliability and operability
Qualifications
- 5\+ years of experience in SRE, DevOps, or infrastructure engineering
- Deep expertise with AWS and cloud-native architectures
- Strong experience with Kubernetes and container orchestration at scale
- Hands-on experience with infrastructure-as-code tools (Terraform or Pulumi)
- Proficiency in Python, Go, or Bash
- Experience with observability tools (Prometheus, Grafana, Datadog, or similar)
- Strong understanding of SLOs, SLIs, and error budgets
- Experience with service mesh technologies (Istio, Linkerd)
- Familiarity with chaos engineering tools (Chaos Monkey, Gremlin, LitmusChaos)
- Background in Oracle database reliability and administration
- Contributions to open-source infrastructure projects
- Experience in a high-growth SaaS or product-led environment
- Excellent English communication skills (written and spoken)
Benefits
- Competitive compensation with a robust benefits package, including medical, dental, vision, LTD, HSA, FSA, free virtual mental health counseling, and health and wellness perks
- Medical insurance coverage effective on the first day of employment
- 401(k) match that is 100% immediately vested
- Discretionary and charitable time off program
- Home office setup allowance for fully remote employees
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.