Location
Remote
Salary
Not specified
Type
fulltime
Posted
Today
Job Description
Company Description
VOLTO Consulting empowers organizations by blending technology, talent, and strategy to drive transformation and innovation. Offering services such as SAP, AI, digital engineering, cloud, and infrastructure support, VOLTO enables clients to scale and adapt rapidly. With expertise across industries like banking, healthcare, energy, and manufacturing, the company delivers tailored, outcome-driven solutions. Its nearshore and offshore delivery models, combined with deep domain knowledge, help clients build and optimize systems and teams for sustainable growth. Trusted by enterprises and startups alike, VOLTO bridges strategy with execution to deliver measurable results.
Role Description
Key Responsibilities
- Own end-to-end optimization of ML models across ASR, TTS, NLP, and LLM pipelines for production
- Reduce model size, memory footprint, and inference latency for real-time applications
- Implement model compression techniques (quantization, pruning, distillation) and optimize inference using ONNX, TensorRT, TVM, or similar frameworks
- Design and deploy efficient inference pipelines for edge and cloud environments, enabling low compute, on-device AI
- Develop performance-critical components in C/C\+\+ and build high-concurrency backend services using Go (Golang)
- Architect scalable, low-latency microservices (gRPC/REST) and optimize throughput via batching, caching, and parallel processing
- Maximize CPU/GPU utilization and reduce cost per inference
- Integrate optimized inference engines with production systems and manage deployments via Docker and Kubernetes
- Define and track performance metrics (latency, memory, throughput, cost) and establish monitoring/observability frameworks
- Collaborate with ASR, TTS, and LLM teams to productionize models and drive trade-offs between accuracy, latency, and cost
- Mentor engineers and establish best practices in ML systems optimization and performance engineering
Education \& Experience
• 5–10\+ years in ML engineering / systems / performance optimization
• Strong experience in deploying ML models at scale (edge \+ cloud) Technical Skills
• Strong programming in Go (Golang) and C/C\+\+ (both mandatory)
• Experience with low-level optimization, memory management, and high-performance computing
• Deep learning frameworks: PyTorch / TensorFlow
- Inference optimization tools: ONNX Runtime, TensorRT, TVM, OpenVINO
- Backend systems: gRPC, REST APIs, microservices architecture
- Strong understanding of hardware-aware optimization (CPU, GPU, edge accelerators)
- Infrastructure: Docker, Kubernetes, cloud platforms (AWS/GCP/Azure) Preferred Qualifications o Experience with speech models (ASR/TTS) and/or LLMs o Edge AI deployment (Android, iOS, IoT, embedded systems) o Familiarity with WebRTC / real-time streaming systems o Contributions to performance-critical ML systems or infra
Looking for more opportunities?
Browse thousands of graduate jobs and entry-level positions.