About Vumedi:

Vumedi is the largest video education platform for doctors worldwide, dedicated to advancing medical education through innovative video-based learning. Our mission is to empower healthcare professionals by providing them with access to the latest clinical knowledge and surgical techniques from experts around the globe. We curate a vast library of high-quality educational content, enabling users to enhance their skills, stay informed about industry trends, and improve patient outcomes. We are headquartered in

Oakland, CA

, and have additional offices in Minneapolis, MN, and Zagreb, Croatia.

We're hiring a

Senior/Staff/Principal

DevOps Engineer

to lead the development of our digital platform and products at this critical stage of Vumedi's growth.

Why join Vumedi right now?

Build technology that matters in a fast-scaling Silicon Valley digital healthcare company

: Your work directly impacts how doctors across the world learn and make decisions that save lives.

Grow as we grow:

Be part of a company in an accelerated growth phase, where expanding teams, products, and markets create real opportunities for ownership, leadership, and career progression.

Build with AI

: Work on applied LLM systems - from intelligent search to AI-driven content agents - and shape how AI transforms medical knowledge delivery.

Own your craft end-to-end

: Take full responsibility for building systems that scale globally and power mission-critical workflows.

Collaborate globally:

Join a world-class team of passionate engineers on modern tech stack which will further drive your career development.

Have real product impact

: Influence the direction of product development by collaborating closely with product and leadership teams.

About the role:

We are looking for a DevOps Engineer to join our engineering team and take ownership of our infrastructure, deployment processes, and overall platform reliability. You will work closely with backend and data teams to support a growing video and data platform used by millions of healthcare professionals worldwide.

In this role, you will focus on improving our CI/CD pipelines, system reliability, and developer experience, while helping scale our cloud infrastructure in a secure and cost-efficient way. You will work extensively with AWS services (compute, storage, networking, IAM, monitoring) and help ensure our systems are reliable, observable, and well-architected.

You'll also support and enable emerging AI/ML and LLM-powered systems used for large-scale medical content processing, helping build and operate the infrastructure required for these workloads. This includes improving data pipelines, optimizing resource usage, and ensuring production-grade reliability of AI-driven services.

This is a high-impact role with a broad scope—from supporting production systems and data pipelines to driving long-term improvements in how we build, deploy, and operate our platform, with strong ownership and autonomy in shaping DevOps practices.

What you will do:

Own and improve our infrastructure, CI/CD pipelines, and deployment processes across multiple environments
Work with AWS services (compute, storage, networking, IAM, monitoring) to ensure scalable, secure, and reliable systems
Collaborate closely with backend and data teams to support production systems, data pipelines, and overall platform reliability
Continuously improve developer experience by streamlining workflows, reducing friction, and enabling faster, safer deployments
Contribute to improving security practices, access control, and compliance of our infrastructure
Automate infrastructure and workflows using Python
Improve observability by implementing and maintaining monitoring, logging, and alerting systems
Troubleshoot production issues, participate in incident response, and implement long-term fixes to improve system stability
Identify and drive improvements in performance, scalability, and cost efficiency across the platform
Support and scale AI/ML and LLM-based systems, ensuring reliable infrastructure for data processing and content classification workloads

Who you are:

You have 5\+ years of experience in DevOps, SRE, or infrastructure engineering, with a strong focus on cloud-native environments (preferably AWS)
You have managed cloud infrastructure (networking, IAM, compute, storage) with a strong understanding of security best practices and cost optimization
You have experience building and maintaining CI/CD pipelines to support rapid, reliable software delivery across multiple environments
You are comfortable writing Python for automation, scripting, and building internal tooling to improve infrastructure and developer workflows
You have a strong understanding of monitoring, logging, and observability (e.g., Datadog, Prometheus, CloudWatch), and proactively identifying and resolve issues
You are comfortable debugging production issues across systems and collaborating with engineering teams to resolve them
You are proactive, take ownership, and enjoy working in environments with high autonomy and evolving processes
You communicate clearly and collaborate effectively with engineers, product managers, and other stakeholders
You are curious and motivated to learn, especially in areas like AI/ML infrastructure and large-scale systems

Required Qualifications:

5\+ years of experience in DevOps, Site Reliability Engineering, or infrastructure-focused roles
Proven experience designing and operating scalable, reliable, and secure cloud infrastructure (preferably AWS) in production environments
Strong understanding of cloud security best practices (IAM, network security, secrets management), preferably within AWS
Proficiency in Python for automation, scripting, and tooling
Hands-on experience building and maintaining CI/CD pipelines
Experience with monitoring, logging, and alerting tools (e.g., Datadog, CloudWatch, Prometheus)
Experience working in a Linux-based environment
Ability to drive infrastructure and DevOps strategy, balancing scalability, reliability, and cost
Experience working cross-functionally and influencing engineering teams on best practices and architectural decisions
Strong ownership mindset with the ability to operate autonomously in ambiguous environments

Preferred Qualifications:

Experience supporting or scaling AI/ML or LLM-based systems in production
You have worked with containerized applications (Docker) and are familiar with orchestration concepts (Kubernetes or ECS is a plus)
You are familiar with Infrastructure as Code principles (e.g., Terraform) and have experience implementing Infrastructure as Code from scratch in existing environments
You have experience working with or supporting backend systems and data platforms (e.g., Postgres, Airflow is a plus)
Background in backend engineering or software development
Experience working in a fast-paced startup or scale-up environment
Experience leading and mentoring engineers, while contributing to team-wide best practices

This is a hybrid role, working 3 days a week (Monday, Wednesday, and Friday) in our Oakland office.

DevOps Engineer

Job Description

Looking for more opportunities?