Responsible for ensuring the production-grade reliability, accuracy, and performance of our AWS-based agentic AI ecosystem.
Bridges platform engineering and agent development to ensure agentic workflows are observable, secure, resilient, and performant.
Lead investigations of complex agent/AI workflow failures using logs, metrics, and traces (CloudWatch, X-Ray, Splunk, New Relic or similar).
Run blameless post-mortems and drive preventive actions.
Improve the quality and performance of Retrieval-Augmented Generation (RAG) and agent workflows by tuning retrieval, ranking/re-ranking, prompt/tooling behavior, and data access patterns across stores such as PostgreSQL/Redshift (or equivalent).
Establish and oversee evaluation approaches for models, RAG, and agents (automated test suites, scorecards, success criteria; e.g., LLM-as-a-judge/RAGAS concepts or equivalents) to improve fidelity and reduce regressions.
Partner with InfoSec/AppSec to review architectures and ensure designs follow enterprise security patterns, identity controls (IAM, SSO/federation such as Okta/Cognito or equivalents), and data residency requirements.
Work with Governance teams to implement and monitor guardrails and controls (e.g., model safety guardrails, policy enforcement, cost/usage controls) across the AI platform.
Contribute to or help operationalize agent interoperability/protocol patterns (A2A/ACP/AP2 or similar concepts), where applicable.
Drive “Design for Reliability” patterns across both Platform and Agent Building teams—fault tolerance, graceful degradation, load/performance testing, incident readiness, and operational excellence.
Translate reliability risks, performance trends, and operational metrics into clear business language for senior leaders, risk, and product owners.
Coach DevLeads and architects on debugging agent behaviors, strengthening observability pipelines, improving orchestration, and hardening production deployments.
Requirements
Bachelor's degree in Computer Science, Engineering, Information Systems, or related field (or equivalent experience)
10–14 years of IT experience including meaningful roles in application development, platform engineering, SRE/operations, and/or architecture or in lieu of a degree 12–16 years of IT experience including meaningful roles in application development, platform engineering, SRE/operations, and/or architecture.
Strong experience operating and improving reliability of cloud-native systems (AWS preferred; comparable cloud experience acceptable), including containers/compute, networking, and security fundamentals.
Experience supporting AI/ML systems is beneficial, but not mandatory if you demonstrate strong troubleshooting ability, systems thinking, and a proactive plan/track record of learning AI reliability patterns quickly.
Strong ability to script/build tooling in Python (or similar language) for reliability automation, analysis, testing, and operational workflows.
Hands-on experience with observability practices and tools (CloudWatch/X-Ray/Splunk/New Relic or similar)—dashboards, alerts, tracing, log analysis, incident response and post-mortems.
Experience with Infrastructure-as-Code (Terraform preferred; similar tools acceptable) and practical knowledge of data stores used in production systems (SQL proficiency helpful; PostgreSQL/Redshift experience a plus, but equivalents acceptable).
Working knowledge of identity and security patterns (OAuth2, SSO/federation, IAM roles/policies/SCP concepts) and secure API/service design.
Proven ability to lead through influence, drive standards/guardrails, and align multiple agile teams in a matrixed environment.
Tech Stack
Amazon Redshift
AWS
Cloud
Postgres
Python
Ray
Splunk
SQL
Terraform
Benefits
best-in-class employee benefits and programs
work-life integration
overall well-being
career advancement opportunities
upskilling opportunities
focusing on Advancing Diverse Talent to take up leadership roles