Location
Remote
Salary
$140,000 - $180,000 /yearly
Type
fulltime
Posted
Today
Job Description
Senior / Staff DevOps Engineer (Platform \& Reliability)
Location:
Remote (U.S. or Canada)
Company:
Peerlogic
The Role
Peerlogic is hiring a
Senior / Staff DevOps Engineer
to own the platform, infrastructure, and reliability of a production system that spans
application services, AI/ML workloads, and real-time voice infrastructure
.
You are replacing a strong DevOps leader and not building from scratch. The system works. Your job is to
make it exceptional
.
This is not a support role.
This is not a ticket-driven role.
You will:
- Own reliability end-to-end
- Make architectural decisions with real consequences
- Operate in ambiguity without waiting for direction
If you prefer clearly defined scopes, narrow ownership, or “assigned work,” this is not the role.
What You’ll Own
Platform \& Infrastructure
- End-to-end ownership of
cloud \+ hybrid infrastructure
(AWS, GCP, and physical environments)
- Multi-region architecture targeting
99\.999% uptime
- Kubernetes clusters and container orchestration across all services
- CI/CD pipelines (GitHub Actions); reliability, speed, and developer experience
- Infrastructure as Code (Terraform, Ansible)
Reliability \& Observability
- Design and enforce
SLOs, SLIs, and error budgets
- Build a
best-in-class observability stack
(metrics, logs, traces)
- Drive incident response, postmortems, and systemic fixes (not band-aids)
- Reduce MTTR and eliminate repeat incidents
Data \& Event Systems
- Ownership of
event-driven architecture
(RabbitMQ or equivalent)
- Ensure
durability, replayability, and correctness
of pipelines
- Design and maintain
backfill and recovery strategies
- Improve debuggability of asynchronous systems
AI / ML Infrastructure
- Operate and scale
LLM-powered systems
(Bedrock, SageMaker, or equivalent)
- Manage inference workloads with a focus on:
- Latency
- Cost
- Reliability
- Build and maintain:
- Evaluation pipelines
- Dataset versioning
- Reproducible ML workflows
Performance \& Cost
- Own
infrastructure cost efficiency
across:
- Compute
- Storage
- LLM usage
- Continuously optimize tradeoffs between:
- Performance
- Reliability
- Cost
Security \& Compliance
- Own infrastructure posture for
SOC 2 and HIPAA
- Ensure secure handling of PHI (encryption, access controls, auditability)
- Implement and enforce:
- Secrets management
- IAM best practices
- Network isolation
- Partner with compliance tooling (e.g., Sprinto)
What You Will NOT Own
- SIP routing, dial plans, or telecom call flows
- Carrier integrations or VoIP-specific logic
(You will collaborate closely with a dedicated VoIP Infrastructure Engineer where systems intersect.)
What We’re Looking For
Experience
- 5–10\+ years in DevOps, SRE, or Infrastructure Engineering
- Proven ownership of
production systems at scale
- Experience operating
multi-region, high-availability systems
Technical Depth
Strong hands-on experience with:
- Kubernetes, ECS, and containerized systems
- Terraform and infrastructure as code
- CI/CD systems (GitHub Actions preferred)
- Networking fundamentals (TCP/IP, DNS, ip tables, load balancing)
You should also:
- Be comfortable writing code (Python, Go, or similar)
- Have experience with
real-time or low-latency systems
- Understand
event-driven architectures
deeply
Mindset (this matters more than tools)
- You take ownership beyond your “area”
- You fix root causes, not symptoms
- You make decisions with incomplete information
- You care about
systems, not just infrastructure
Our Stack (Partial)
- AWS, GCP, Kubernetes
- Python, Postgres
- RabbitMQ / async pipelines
- LLM systems (multi-agent, inference pipelines)
- VoIP \+ EHR integrations (adjacent systems)
What Success Looks Like
Within 3–6 months:
- Reliability improves measurably (fewer incidents, faster recovery)
- Observability provides
clear, actionable insights
across systems
- CI/CD becomes faster, safer, and more predictable
- Event-driven systems are easier to debug and recover
Within 6–12 months:
- Platform operates at or near
5-nines reliability
- Infrastructure scales cleanly across app, AI, and voice workloads
- AI systems are
cost-efficient and production-grade
- Engineering velocity increases due to strong platform foundations
Team \& Environment
- \~10 person engineering team
- Reports directly to CTO
- High-ownership, fast-moving startup
- Expectation of after-hours ownership when needed
Compensation
- $140K – $180K CAD base (flexible for Senior vs Staff)
- Equity included
- Will stretch for the right candidate
Why This Role Matters
Peerlogic sits at the intersection of
healthcare, AI, and real-time communication
.
This role ensures the platform is:
- Fast enough for real-time interaction
- Reliable enough for healthcare workflows
- Scalable enough to support rapid growth
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.