Job Profile: LLM Observability Engineer

Location – Bhubaneshwar

Job Description:

We are looking for a skilled

LLM Observability Engineer

to join our team and ensure optimal performance, reliability, and cost-efficiency of our large language model (LLM) applications. You will be instrumental in designing performance tests, implementing observability practices, and providing insights to enhance model quality and system robustness.

Key Responsibilities:

Must Have

Design and Execute Performance Tests: Develop and implement comprehensive test plans and scripts to evaluate the performance, scalability, and stability of LLM applications under various loads.
Implement LLM Observability: Instrument LLM applications to capture rich telemetry data, including prompts, responses, token usage, latency, and error information, using specialized tools and frameworks like

Datadog

LangChain

, or OpenTelemetry.

Monitor and Analyze Metrics: Track key performance indicators (KPIs) such as response time, throughput, cost per query, accuracy, and resource utilization using real-time dashboards and monitoring systems.
Identify and Mitigate Bottlenecks: Analyze performance test results and production data to pinpoint performance bottlenecks, errors, and potential issues (e.g., high latency in RAG pipelines) and collaborate with development teams on optimization.
Conduct Automated Evaluations: Implement automated quality checks and evaluations (e.g., hallucination detection, toxicity classifiers, relevance scoring) to continuously assess model output quality.
Strong knowledge of performance testing methodologies and load testing tools such as

JMeter

, LoadRunner, or Gatling. Familiarity with the unique challenges of LLMs, including non-determinism, hallucinations, and prompt sensitivity.

Experience with LLM observability platforms and tools (e.g., Datadog LLM Observability, Arize AI,

Langfuse

) is highly desirable. Proficiency in programming/scripting languages (e.g.,

Python

, Java).

Nice to have

Troubleshoot Production Issues: Utilize tracing and logging data to quickly diagnose the root cause of issues in complex LLM workflows and agentic applications.
Ensure Security and Compliance: Monitor model behavior for potential security risks, such as prompt injections or sensitive data leaks, and ensure compliance with data protection regulations.
Optimize Costs: Track and manage token usage and computational resource consumption to ensure cost-effectiveness and alert teams to potential budget overruns.
Collaborate and Report: Work closely with data scientists, ML engineers, and QA teams to provide actionable insights and recommendations for model fine-tuning and system architecture improvements.
Excellent analytical problem-solving and communication skills.

Experience Required:

4 positions (2-8 years’ experience)

Machine Learning Engineer

Job Description

Looking for more opportunities?