Location
Ashburn, VA
Salary
Not specified
Type
fulltime
Posted
Today
Job Description
Senior AI Engineer – AI Center of Excellence (AI CoE)
Location: New Jersey, Dallas TX ,Santa Clara CA, Ashburn VA, North Virginia(Hybrid)
Job Type: Fulltime
Domain - AI \& Data centers
Experience:
12\+ Years
Role Overview
This is a
strategic, hands-on senior engineering role
within the
AI Center of Excellence (AI CoE)
, responsible for designing, building, and operating
AI infrastructure and AI Factory platforms
across
hybrid environments (on‑prem, private cloud, and public cloud)
.
The role works closely with client
and leading OEM partners
as well as
internal Sales, Pre‑Sales, and Delivery teams
, to
identify, shape, and execute AI‑driven business opportunities
across the US and EU regions.
This is a
quota‑driven, techno‑commercial role
requiring
deep technical execution
along with
stakeholder interaction and customer‑facing leadership
.
Key Responsibilities
AI Infrastructure \& Platform Engineering
- Design, deploy, and operate
hybrid Kubernetes clusters
across
AWS, Azure, GCP, and on‑prem environments
(bare metal, NVIDIA DGX, Grace Hopper).
- Own production-grade
GPU infrastructure
using
NVIDIA GPU Operator
, including:
- CUDA, drivers, MIG
- GPU‑aware scheduling and resource isolation policies
- Build and maintain
high‑availability, scalable AI platforms
supporting enterprise workloads.
MLOps \& GenAI Platform Development
- Build
production‑grade MLOps pipelines
using:
- Kubeflow Pipelines
- GitOps (Argo CD / Flux)
- MLflow / DVC
- Deploy and operate
Large Language Models (LLMs)
using:
- NVIDIA Triton Inference Server
- TensorRT‑LLM
- vLLM
- Custom FastAPI / gRPC services
- Implement advanced inference techniques:
- Quantization, LoRA
- Dynamic batching
- Tenant‑level quota enforcement
- Safety \& content filtering integrations
Data \& Retrieval-Augmented Generation (RAG)
- Integrate and optimize
vector databases
for RAG and similarity search:
- Milvus, Pinecone, Qdrant, Weaviate, FAISS
- Enable scalable semantic search and GenAI-powered enterprise applications.
Observability, Security \& Reliability
- Implement full‑stack observability using:
- Prometheus, Grafana
- Loki / ELK
- OpenTelemetry
- Define and monitor
SLIs / SLOs
for AI platforms.
- Enforce
security and compliance
standards:
- Kubernetes RBAC
- OPA / Gatekeeper
- Vault / KMS
- Image signing, policy enforcement
- GDPR / HIPAA compliance
Cost, Performance \& Capacity Optimization
- Optimize GPU utilization through:
- Capacity planning
- Auto‑scaling \& spot instances
- Cost transparency and chargeback models
- Improve platform efficiency while maintaining performance SLAs.
Enablement \& Technical Leadership
- Convert experimentation into
reproducible production pipelines
.
- Enable engineering teams through:
- Technical documentation
- Tutorials and best practices
- Office hours and knowledge sessions
- Evaluate emerging technologies and lead
PoCs
across:
- NVIDIA innovations
- Open‑source ecosystems (Kubeflow, LangChain, vLLM, TGI, etc.)
- Drive the
AI Infra \& Platform technology roadmap
.
Required Experience \& Skills
Technical Expertise
- 8\+ years
of hands‑on experience designing and operating
production Kubernetes platforms
(cloud \+ on‑prem).
- Deep expertise in
NVIDIA GPU stack
(CUDA, MIG, GPU Operator).
- Strong hands‑on experience with:
- Kubeflow Pipelines or equivalent MLOps platforms
- Large‑scale LLM deployment and inference optimization
- Proficiency in
Python
and AI frameworks:
- PyTorch, TensorFlow
- Hugging Face, LangChain
- Infrastructure as Code (IaC):
- Helm, Kustomize, Terraform
- Experience with
vector databases
and RAG architectures.
- Strong
SRE / observability background
.
- Security‑first mindset with enterprise compliance exposure.
Nice to Have
- Experience with
NVIDIA DGX and Grace Hopper
platforms.
- Knowledge of
OpenShift, k3s
, or edge‑focused deployments.
- Experience with:
- KServe, LWS, serverless inference
- Contributions to open‑source projects (Kubernetes, Kubeflow, Triton, Milvus, vLLM).
- Certifications:
- CKA
- Cloud AI/ML certifications
- NVIDIA certifications
Qualifications
- B.E / B.Tech
with a minimum
60% across academics
.
- Proven experience delivering AI solutions across
on‑prem, cloud, and hybrid environments
.
- Strong analytical, strategic thinking, and stakeholder communication skills.
- Solid understanding of
data centers, cloud platforms, AI \& GenAI ecosystems
.
Role Specifics
- Hands‑on senior engineering role
- Strong
Techno‑Commercial orientation
- High ownership, visibility, and impact role
Disclaimer
*HCL is an equal opportunity employer, committed to providing equal employment opportunities to all applicants and employees regardless of race, religion, sex, color, age, national origin, pregnancy, sexual orientation, physical disability or genetic information, military or veteran status, or any other protected classification, in accordance with federal, state, and/or local law. Should any applicant have concerns about discrimination in the hiring process, they should provide a detailed report of those concerns to
for investigation.*
Compensation and Benefits
A candidate’s pay within the range will depend on their work location, skills, experience, education, and other factors permitted by law. This role may also be eligible for performance-based bonuses subject to company policies. In addition, this role is eligible for the following benefits subject to company policies: medical, dental, vision, pharmacy, life, accidental death \& dismemberment, and disability insurance; employee assistance program; 401(k) retirement plan; 10 days of paid time off per year (some positions are eligible for need-based leave with no designated number of leave days per year); and 10 paid holidays per year.
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.