Job Description
About Vumedi:
Vumedi is the largest video education platform for doctors worldwide, dedicated to advancing medical education through innovative video-based learning. Our mission is to empower healthcare professionals by providing them with access to the latest clinical knowledge and surgical techniques from experts around the globe. We curate a vast library of high-quality educational content, enabling users to enhance their skills, stay informed about industry trends, and improve patient outcomes. We are headquartered in
Oakland, CA
, and have additional offices in Minneapolis, MN, and Zagreb, Croatia.
We're hiring a
Senior/Staff/Principal
DevOps Engineer
to lead the development of our digital platform and products at this critical stage of Vumedi's growth.
Why join Vumedi right now?
- Build technology that matters in a fast-scaling Silicon Valley digital healthcare company
: Your work directly impacts how doctors across the world learn and make decisions that save lives.
- Grow as we grow:
Be part of a company in an accelerated growth phase, where expanding teams, products, and markets create real opportunities for ownership, leadership, and career progression.
- Build with AI
: Work on applied LLM systems - from intelligent search to AI-driven content agents - and shape how AI transforms medical knowledge delivery.
- Own your craft end-to-end
: Take full responsibility for building systems that scale globally and power mission-critical workflows.
- Collaborate globally:
Join a world-class team of passionate engineers on modern tech stack which will further drive your career development.
- Have real product impact
: Influence the direction of product development by collaborating closely with product and leadership teams.
About the role:
We are looking for a DevOps Engineer to join our engineering team and take ownership of our infrastructure, deployment processes, and overall platform reliability. You will work closely with backend and data teams to support a growing video and data platform used by millions of healthcare professionals worldwide.
In this role, you will focus on improving our CI/CD pipelines, system reliability, and developer experience, while helping scale our cloud infrastructure in a secure and cost-efficient way. You will work extensively with AWS services (compute, storage, networking, IAM, monitoring) and help ensure our systems are reliable, observable, and well-architected.
You'll also support and enable emerging AI/ML and LLM-powered systems used for large-scale medical content processing, helping build and operate the infrastructure required for these workloads. This includes improving data pipelines, optimizing resource usage, and ensuring production-grade reliability of AI-driven services.
This is a high-impact role with a broad scope—from supporting production systems and data pipelines to driving long-term improvements in how we build, deploy, and operate our platform, with strong ownership and autonomy in shaping DevOps practices.
What you will do:
- Own and improve our infrastructure, CI/CD pipelines, and deployment processes across multiple environments
- Work with AWS services (compute, storage, networking, IAM, monitoring) to ensure scalable, secure, and reliable systems
- Collaborate closely with backend and data teams to support production systems, data pipelines, and overall platform reliability
- Continuously improve developer experience by streamlining workflows, reducing friction, and enabling faster, safer deployments
- Contribute to improving security practices, access control, and compliance of our infrastructure
- Automate infrastructure and workflows using Python
- Improve observability by implementing and maintaining monitoring, logging, and alerting systems
- Troubleshoot production issues, participate in incident response, and implement long-term fixes to improve system stability
- Identify and drive improvements in performance, scalability, and cost efficiency across the platform
- Support and scale AI/ML and LLM-based systems, ensuring reliable infrastructure for data processing and content classification workloads
Who you are:
- You have 5\+ years of experience in DevOps, SRE, or infrastructure engineering, with a strong focus on cloud-native environments (preferably AWS)
- You have managed cloud infrastructure (networking, IAM, compute, storage) with a strong understanding of security best practices and cost optimization
- You have experience building and maintaining CI/CD pipelines to support rapid, reliable software delivery across multiple environments
- You are comfortable writing Python for automation, scripting, and building internal tooling to improve infrastructure and developer workflows
- You have a strong understanding of monitoring, logging, and observability (e.g., Datadog, Prometheus, CloudWatch), and proactively identifying and resolve issues
- You are comfortable debugging production issues across systems and collaborating with engineering teams to resolve them
- You are proactive, take ownership, and enjoy working in environments with high autonomy and evolving processes
- You communicate clearly and collaborate effectively with engineers, product managers, and other stakeholders
- You are curious and motivated to learn, especially in areas like AI/ML infrastructure and large-scale systems
Required Qualifications:
- 5\+ years of experience in DevOps, Site Reliability Engineering, or infrastructure-focused roles
- Proven experience designing and operating scalable, reliable, and secure cloud infrastructure (preferably AWS) in production environments
- Strong understanding of cloud security best practices (IAM, network security, secrets management), preferably within AWS
- Proficiency in Python for automation, scripting, and tooling
- Hands-on experience building and maintaining CI/CD pipelines
- Experience with monitoring, logging, and alerting tools (e.g., Datadog, CloudWatch, Prometheus)
- Experience working in a Linux-based environment
- Ability to drive infrastructure and DevOps strategy, balancing scalability, reliability, and cost
- Experience working cross-functionally and influencing engineering teams on best practices and architectural decisions
- Strong ownership mindset with the ability to operate autonomously in ambiguous environments
Preferred Qualifications:
- Experience supporting or scaling AI/ML or LLM-based systems in production
- You have worked with containerized applications (Docker) and are familiar with orchestration concepts (Kubernetes or ECS is a plus)
- You are familiar with Infrastructure as Code principles (e.g., Terraform) and have experience implementing Infrastructure as Code from scratch in existing environments
- You have experience working with or supporting backend systems and data platforms (e.g., Postgres, Airflow is a plus)
- Background in backend engineering or software development
- Experience working in a fast-paced startup or scale-up environment
- Experience leading and mentoring engineers, while contributing to team-wide best practices
This is a hybrid role, working 3 days a week (Monday, Wednesday, and Friday) in our Oakland office.
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.