Senior AI Engineer – AI Center of Excellence (AI CoE)

Location: New Jersey, Dallas TX ,Santa Clara CA, Ashburn VA, North Virginia(Hybrid)

Job Type: Fulltime

Domain - AI \& Data centers

Experience:

12\+ Years

Role Overview

This is a

strategic, hands-on senior engineering role

within the

AI Center of Excellence (AI CoE)

, responsible for designing, building, and operating

AI infrastructure and AI Factory platforms

across

hybrid environments (on‑prem, private cloud, and public cloud)

The role works closely with client

and leading OEM partners

as well as

internal Sales, Pre‑Sales, and Delivery teams

, to

identify, shape, and execute AI‑driven business opportunities

across the US and EU regions.

This is a

quota‑driven, techno‑commercial role

requiring

deep technical execution

along with

stakeholder interaction and customer‑facing leadership

Key Responsibilities

AI Infrastructure \& Platform Engineering

Design, deploy, and operate

hybrid Kubernetes clusters

across

AWS, Azure, GCP, and on‑prem environments

(bare metal, NVIDIA DGX, Grace Hopper).

Own production-grade

GPU infrastructure

using

NVIDIA GPU Operator

, including:

CUDA, drivers, MIG
GPU‑aware scheduling and resource isolation policies
Build and maintain

high‑availability, scalable AI platforms

supporting enterprise workloads.

MLOps \& GenAI Platform Development

Build

production‑grade MLOps pipelines

using:

Kubeflow Pipelines
GitOps (Argo CD / Flux)
MLflow / DVC
Deploy and operate

Large Language Models (LLMs)

using:

NVIDIA Triton Inference Server
TensorRT‑LLM
vLLM
Custom FastAPI / gRPC services
Implement advanced inference techniques:
Quantization, LoRA
Dynamic batching
Tenant‑level quota enforcement
Safety \& content filtering integrations

Data \& Retrieval-Augmented Generation (RAG)

Integrate and optimize

vector databases

for RAG and similarity search:

Milvus, Pinecone, Qdrant, Weaviate, FAISS
Enable scalable semantic search and GenAI-powered enterprise applications.

Observability, Security \& Reliability

Implement full‑stack observability using:
Prometheus, Grafana
Loki / ELK
OpenTelemetry
Define and monitor

SLIs / SLOs

for AI platforms.

Enforce

security and compliance

standards:

Kubernetes RBAC
OPA / Gatekeeper
Vault / KMS
Image signing, policy enforcement
GDPR / HIPAA compliance

Cost, Performance \& Capacity Optimization

Optimize GPU utilization through:
Capacity planning
Auto‑scaling \& spot instances
Cost transparency and chargeback models
Improve platform efficiency while maintaining performance SLAs.

Enablement \& Technical Leadership

Convert experimentation into

reproducible production pipelines

Enable engineering teams through:
Technical documentation
Tutorials and best practices
Office hours and knowledge sessions
Evaluate emerging technologies and lead

PoCs

across:

NVIDIA innovations
Open‑source ecosystems (Kubeflow, LangChain, vLLM, TGI, etc.)
Drive the

AI Infra \& Platform technology roadmap

Required Experience \& Skills

Technical Expertise

8\+ years

of hands‑on experience designing and operating

production Kubernetes platforms

(cloud \+ on‑prem).

Deep expertise in

NVIDIA GPU stack

(CUDA, MIG, GPU Operator).

Strong hands‑on experience with:
Kubeflow Pipelines or equivalent MLOps platforms
Large‑scale LLM deployment and inference optimization
Proficiency in

Python

and AI frameworks:

PyTorch, TensorFlow
Hugging Face, LangChain
Infrastructure as Code (IaC):
Helm, Kustomize, Terraform
Experience with

vector databases

and RAG architectures.

Strong

SRE / observability background

Security‑first mindset with enterprise compliance exposure.

Nice to Have

Experience with

NVIDIA DGX and Grace Hopper

platforms.

Knowledge of

OpenShift, k3s

, or edge‑focused deployments.

Experience with:
KServe, LWS, serverless inference
Contributions to open‑source projects (Kubernetes, Kubeflow, Triton, Milvus, vLLM).
Certifications:
CKA
Cloud AI/ML certifications
NVIDIA certifications

Qualifications

B.E / B.Tech

with a minimum

60% across academics

Proven experience delivering AI solutions across

on‑prem, cloud, and hybrid environments

Strong analytical, strategic thinking, and stakeholder communication skills.
Solid understanding of

data centers, cloud platforms, AI \& GenAI ecosystems

Role Specifics

Hands‑on senior engineering role
Strong

Techno‑Commercial orientation

High ownership, visibility, and impact role

Disclaimer

*HCL is an equal opportunity employer, committed to providing equal employment opportunities to all applicants and employees regardless of race, religion, sex, color, age, national origin, pregnancy, sexual orientation, physical disability or genetic information, military or veteran status, or any other protected classification, in accordance with federal, state, and/or local law. Should any applicant have concerns about discrimination in the hiring process, they should provide a detailed report of those concerns to

[email protected]

for investigation.*

Compensation and Benefits

A candidate’s pay within the range will depend on their work location, skills, experience, education, and other factors permitted by law. This role may also be eligible for performance-based bonuses subject to company policies. In addition, this role is eligible for the following benefits subject to company policies: medical, dental, vision, pharmacy, life, accidental death \& dismemberment, and disability insurance; employee assistance program; 401(k) retirement plan; 10 days of paid time off per year (some positions are eligible for need-based leave with no designated number of leave days per year); and 10 paid holidays per year.

Artificial Intelligence Engineer

Job Description

Looking for more opportunities?