AWSCloudPythonTerraformRAIMLLLMRAGAgenticMLOpsEKSCloudFormationIAMService MeshCI/CDOKRsProduct ManagementRemote Work
About this role
Role Overview
Define and evolve the target architecture and roadmap for enterprise‑scale Data and AI platforms, covering experimentation, training, feature management, model registry, CI/CD, serving, and observability.
Design and build multi‑tenant, multi‑region, highly available AI platforms with clear governance and guardrails.
Partner with product management to define platform vision, backlogs, OKRs, and golden paths that enable self‑service from ideation to production.
Lead capacity planning and cost optimization strategies for GPU and CPU workloads, driving performance and scalability for distributed training and inference.
Integrate AI platforms with enterprise data ecosystems to enable governed, reproducible, and scalable ML pipelines.
Act as a technical leader, translating complex platform concepts into clear value propositions for senior stakeholders across R&D, Commercial, and Operations.
Requirements
Bachelor’s, Master’s, or PhD in Computer Science, Engineering, or a related quantitative field.
Proven experience as a platform or infrastructure engineer supporting ML/AI at scale.
Hands‑on experience with Domino Data Lab.
Strong experience with AWS (or equivalent cloud providers), including compute, storage, networking, IAM, and cost management.
Production experience administering EKS clusters, including GPU workloads, operators, storage classes, and service mesh.
Strong Python development experience, especially for platform automation and tooling.
Solid background in Infrastructure as Code (Terraform, CloudFormation or similar).
Experience with MLOps practices: model pipelines, lifecycle management, CI/CD, and monitoring.
Experience with LLM serving, RAG architectures, vector databases, prompt safety, and token‑aware scaling.
Experience designing and operating agentic systems, including multi‑agent orchestration, tool/action frameworks, safety guardrails, and evaluation of reliability and cost.
Tech Stack
AWS
Cloud
Python
Terraform
Benefits
Permanent contract with a competitive salary.
Flexible working model with remote work options.
Personalized career path and continuous learning (certifications, English training, etc.).
Participation in stable, long‑term projects with high technical complexity.
Flexible working hours and strong work‑life balance focus.