This role is for one of the Weekday’s clients
Min Experience: 6 years
Location: Remote (India)
JobType: full-time
As the Machine Learning Operations Manager, you will oversee the end-to-end ML lifecycle — from model training and deployment to monitoring and optimization. You will lead a small, high-performing team of engineers while remaining hands-on in building scalable, reliable, and efficient ML infrastructure. This role combines strategic leadership with deep technical expertise to ensure smooth collaboration between research, engineering, and operations teams.
Requirements
Key Responsibilities:
- End-to-End ML Lifecycle: Manage training infrastructure, experiment tracking, deployment, and continuous optimization.
- Collaboration with Researchers: Partner with research teams to streamline training, evaluation, and fine-tuning workflows.
- Team Leadership: Mentor and guide a small team of ML engineers (3–4) while contributing as an individual contributor.
- Performance Optimization: Improve latency, throughput, and cost efficiency; ensure robust packaging and runtime reliability.
- Automation & Reliability: Develop systems for CI/CD, versioning, rollback, A/B testing, monitoring, and alerting.
- Infrastructure Management: Maintain scalable, secure, and compliant AI environments across training and inference stages.
- Cloud & AI Integration: Collaborate with cloud providers (AWS, GCP, Azure) and AI platforms to enhance tooling and optimize costs.
- Cross-Functional Collaboration: Support GenAI and AI-driven projects across teams beyond core MLOps responsibilities.
- Architecture & Roadmap: Contribute to architectural planning, documentation, and the continuous evolution of the ML stack.
- Best Practices: Promote automation, MLOps standards, and operational excellence throughout the ML lifecycle.
Requirements:
- 5+ years of hands-on experience in MLOps or ML/AI Engineering.
- Strong understanding of ML/DL concepts and applied experience in model training and deployment infrastructure.
- Proficiency with cloud-native ML tools (e.g., GCP Vertex AI, AWS SageMaker, Kubernetes).
- Experience working across both model training and inference systems.
- Familiarity with model optimization methods such as quantization, distillation, TensorRT, or FasterTransformer.
- Demonstrated ability to lead complex technical projects independently.
- Excellent communication and collaboration skills with a cross-functional mindset.
- Ownership-oriented approach with comfort in driving clarity in ambiguous situations.
Skills:
MLOps, ML Engineering, Machine Learning Infrastructure, Model Deployment, Model Monitoring, CI/CD, Vertex AI, AWS SageMaker, GCP AI Platform, Kubernetes, Docker, MLflow, Kubeflow.
At Weekday (backed by YC; also Product Hunt #1 product of the day), we are building the next frontier in hiring. We have built the largest database of white collar talent in India and have built outreach tools on top of it to generate highest response rates.