
AI Reliability Engineer (SRE) for Gen AI Systems
AI Reliability Engineer (SRE) for Gen AI Systems
Position Overview
We are seeking a highly skilled AI Reliability Engineer (SRE) for Gen AI Systems to join our Application engineering team. In this role, you will bridge the gap between advanced Application Development, cloud infrastructure and machine learning operations. You will be responsible for two core mandates: building and maintaining the Application and infrastructure that powers our large language models (LLMs) and designing autonomous, agentic AI workflows to eliminate operational toil and automate incident response.
Key Responsibilities
AI Infrastructure Reliability: Design, scale, and maintain highly available infrastructure for LLM training, fine-tuning, and inference workloads.
Agentic Operations: Architect and deploy multi-agent GenAI systems to automate alert triage, root cause analysis (RCA), and self-healing system remediation.
GPU & Cluster Management: Optimize GPU orchestration, cluster health, and compute utilization across large-scale Kubernetes clusters.
Performance Monitoring: Define and monitor non-traditional SLOs/SLIs, including Time-to-First-Token (TTFT), Inter-Token Latency, and cost-per-query limits.
Data & Vector Pipeline Ops: Ensure the reliability, latency, and synchronization of vector databases and Retrieval-Augmented Generation (RAG) pipelines.
Incident Management & ChatOps: Integrate LLMs and agentic frameworks into ChatOps tooling (e.g., Slack, Teams) to provide real-time, natural-language incident assistance.
Security & Guardrails: Implement infrastructural boundaries to protect LLM endpoints from prompt injections, hallucinations, and data compliance leaks.
Required Technical Skills
Infrastructure & DevOps: Deep expertise in Kubernetes (EKS/GKE), Infrastructure as Code (Terraform), and CI/CD deployment pipelines.
Software Engineering: Strong proficiency in Python or Go, with experience building tool integrations via APIs and Model Context Protocol (MCP).
GenAI Engineering: Hands-on experience with LLM orchestration frameworks (e.g., AutoGen, LangChain, LlamaIndex).
Data & Vector Systems: Experience managing distributed vector databases (e.g., Pinecone, Milvus, Qdrant, or pgvector).
Observability: Advanced knowledge of cloud monitoring stacks (Datadog, Prometheus, OpenTelemetry) applied to both standard infrastructure and AI workloads (e.g., Triton Inference Server monitoring).
Preferred Qualifications
Background in implementing semantic caching layers to optimize cloud and API token costs.
Proven track record of turning traditional engineering runbooks into executable code for automated agents.
Role Descriptions: We are seeking a highly skilled AI Reliability Engineer (SRE) for Gen AI Systems to join our Application engineering team. In this role| you will bridge the gap between advanced Application Development| cloud infrastructure and machine learning operations. You will be responsible for two core mandates building and maintaining the Application and infrastructure that powers our large language models (LLMs) and designing autonomous| agentic AI workflows to eliminate operational toil and automate incident response.Key ResponsibilitiesAI Infrastructure Reliability Design| scale| and maintain highly available infrastructure for LLM training| fine-tuning| and inference workloads.Agentic Operations Architect and deploy multi-agent GenAI systems to automate alert triage| root cause analysis (RCA)| and self-healing system remediation.GPU Cluster Management Optimize GPU orchestration| cluster health| and compute utilization across large-scale Kubernetes clusters.Performance Monitoring Define and monitor non-traditional SLOsSLIs| including Time-to-First-Token (TTFT)| Inter-Token Latency| and cost-per-query limits.Data Vector Pipeline Ops Ensure the reliability| latency| and synchronization of vector databases and Retrieval-Augmented Generation (RAG) pipelines.Incident Management ChatOps Integrate LLMs and agentic frameworks into ChatOps tooling (e.g.| Slack| Teams) to provide real-time| natural-language incident assistance.Security Guardrails Implement infrastructural boundaries to protect LLM endpoints from prompt injections| hallucinations| and data compliance leaks.Required Technical SkillsInfrastructure DevOps Deep expertise in Kubernetes (EKSGKE)| Infrastructure as Code (Terraform)| and CICD deployment pipelines.Software Engineering Strong proficiency in Python or Go| with experience building tool integrations via APIs and Model Context Protocol (MCP).GenAI Engineering Hands-on experience with LLM orchestration frameworks (e.g.| AutoGen| LangChain| LlamaIndex).Data Vector Systems Experience managing distributed vector databases (e.g.| Pinecone| Milvus| Qdrant| or pgvector).Observability Advanced knowledge of cloud monitoring stacks (Datadog| Prometheus| OpenTelemetry) applied to both standard infrastructure and AI workloads (e.g.| Triton Inference Server monitoring).Preferred QualificationsBackground in implementing semantic caching layers to optimize cloud and API token costs.Proven track record of turning traditional engineering runbooks into executable code for automated agents.
Essential Skills: We are seeking a highly skilled AI Reliability Engineer (SRE) for Gen AI Systems to join our Application engineering team. In this role| you will bridge the gap between advanced Application Development| cloud infrastructure and machine learning operations. You will be responsible for two core mandates building and maintaining the Application and infrastructure that powers our large language models (LLMs) and designing autonomous| agentic AI workflows to eliminate operational toil and automate incident response.Key ResponsibilitiesAI Infrastructure Reliability Design| scale| and maintain highly available infrastructure for LLM training| fine-tuning| and inference workloads.Agentic Operations Architect and deploy multi-agent GenAI systems to automate alert triage| root cause analysis (RCA)| and self-healing system remediation.GPU Cluster Management Optimize GPU orchestration| cluster health| and compute utilization across large-scale Kubernetes clusters.Performance Monitoring Define and monitor non-traditional SLOsSLIs| including Time-to-First-Token (TTFT)| Inter-Token Latency| and cost-per-query limits.Data Vector Pipeline Ops Ensure the reliability| latency| and synchronization of vector databases and Retrieval-Augmented Generation (RAG) pipelines.Incident Management ChatOps Integrate LLMs and agentic frameworks into ChatOps tooling (e.g.| Slack| Teams) to provide real-time| natural-language incident assistance