Skip to main content
P

Machine Learning Engineer

PwC India

Location

Bhubaneswar, Odisha, India

Salary

Not specified

Type

fulltime

Posted

Today

via linkedin

Job Description

Job Profile: LLM Observability Engineer

Location – Bhubaneshwar

Job Description:

We are looking for a skilled

LLM Observability Engineer

to join our team and ensure optimal performance, reliability, and cost-efficiency of our large language model (LLM) applications. You will be instrumental in designing performance tests, implementing observability practices, and providing insights to enhance model quality and system robustness.

Key Responsibilities:

Must Have

  • Design and Execute Performance Tests: Develop and implement comprehensive test plans and scripts to evaluate the performance, scalability, and stability of LLM applications under various loads.
  • Implement LLM Observability: Instrument LLM applications to capture rich telemetry data, including prompts, responses, token usage, latency, and error information, using specialized tools and frameworks like

Datadog

,

LangChain

, or OpenTelemetry.

  • Monitor and Analyze Metrics: Track key performance indicators (KPIs) such as response time, throughput, cost per query, accuracy, and resource utilization using real-time dashboards and monitoring systems.
  • Identify and Mitigate Bottlenecks: Analyze performance test results and production data to pinpoint performance bottlenecks, errors, and potential issues (e.g., high latency in RAG pipelines) and collaborate with development teams on optimization.
  • Conduct Automated Evaluations: Implement automated quality checks and evaluations (e.g., hallucination detection, toxicity classifiers, relevance scoring) to continuously assess model output quality.
  • Strong knowledge of performance testing methodologies and load testing tools such as

JMeter

, LoadRunner, or Gatling. Familiarity with the unique challenges of LLMs, including non-determinism, hallucinations, and prompt sensitivity.

  • Experience with LLM observability platforms and tools (e.g., Datadog LLM Observability, Arize AI,

Langfuse

) is highly desirable. Proficiency in programming/scripting languages (e.g.,

Python

, Java).

Nice to have

  • Troubleshoot Production Issues: Utilize tracing and logging data to quickly diagnose the root cause of issues in complex LLM workflows and agentic applications.
  • Ensure Security and Compliance: Monitor model behavior for potential security risks, such as prompt injections or sensitive data leaks, and ensure compliance with data protection regulations.
  • Optimize Costs: Track and manage token usage and computational resource consumption to ensure cost-effectiveness and alert teams to potential budget overruns.
  • Collaborate and Report: Work closely with data scientists, ML engineers, and QA teams to provide actionable insights and recommendations for model fine-tuning and system architecture improvements.
  • Excellent analytical problem-solving and communication skills.

Experience Required:

  • 4 positions (2-8 years’ experience)

Looking for more opportunities?

Browse thousands of graduate jobs and entry-level positions.

Browse All Jobs