Key Responsibilities

Design and implement comprehensive monitoring and observability systems for all live AI agents — tracking response quality, latency, error rates, and conversation outcomes
Build and maintain evaluation frameworks to measure agent performance against defined benchmarks, including automated quality scoring and regression detection
Manage token usage, API costs, and resource allocation across all agents and LLM providers; provide regular cost reports and optimization recommendations
Develop and maintain conversation logging infrastructure for analysis, debugging, and compliance purposes
Implement hallucination detection, content safety filters, and guardrail systems to protect end users and maintain brand integrity
Create and manage alerting systems for agent failures, performance degradation, and anomalous behavior patterns
Build A/B testing and prompt versioning infrastructure to support the Prompt Architect in iterative agent improvement
Establish and maintain CI/CD pipelines for prompt deployments, ensuring changes are tested, staged, and rolled out safely
Develop dashboards and reporting tools that give leadership visibility into agent performance, ROI, and operational health
Collaborate with the AI/ML Engineer on infrastructure optimization and with the Solutions Engineer on production reliability

Required Qualifications

3\+ years of experience in DevOps, SRE, MLOps, or a similar operations-focused engineering role
Strong proficiency in Python and experience building monitoring/observability systems
Experience with logging and monitoring tools (Datadog, Grafana, Prometheus, CloudWatch, or similar)
Understanding of LLM APIs, token-based pricing models, and AI system architectures
Experience building evaluation frameworks, testing pipelines, or quality assurance systems for software products
Familiarity with CI/CD tools and deployment automation (GitHub Actions, Jenkins, or similar)
Strong analytical skills with the ability to identify patterns in data and translate them into actionable insights

Preferred Qualifications

Direct experience with LLMOps tooling (LangSmith, Weights \& Biases, Humanloop, or similar)
Experience managing costs and optimizing resource usage for API-heavy systems
Background in building dashboards and data visualization (Metabase, Looker, custom solutions)
Familiarity with prompt engineering and understanding of how prompt changes affect model behavior
Experience with multi-agent systems or orchestration platform monitoring
Knowledge of AI safety, content moderation, and responsible AI deployment practices

What Success Looks Like

Within 30 days: Full monitoring and logging coverage for all active agents; baseline performance metrics established
Within 60 days: Cost optimization implemented saving 15%\+ on token spend; automated alerting catching issues before users report them
Within 90 days: Evaluation framework live with automated quality scoring; prompt versioning and A/B testing infrastructure operational; leadership dashboard delivering weekly insights

LLM Ops Engineer

Job Description