The University of Texas MD Anderson Cancer Center is a leading institution in cancer care and research, seeking a Senior Machine Learning Engineer – Agentic AI to enhance their AI capabilities. This role involves designing and operating enterprise-scale agentic AI platforms to ensure safe deployment of AI systems in healthcare environments, focusing on platform architecture and operational safeguards.
Responsibilities:
- Lead the design, evolution, and operation of the enterprise agentic AI platform in collaboration with enterprise architects and platform ML engineers
- Build platform components that enable interoperability between first‑party and third‑party agents, including identity, state, memory, tool access, orchestration, auditability, and policy enforcement
- Define and document standardized integration patterns connecting agents with enterprise business systems, data platforms, APIs, and health IT systems
- Provide reusable platform services, reference implementations, and SDKs that reduce risk and accelerate delivery for applied teams
- Design and operate validation and de‑risking frameworks, including simulation, sandboxing, shadow execution, canary releases, and continuous behavior monitoring
- Establish and enforce platform standards for agent development, including interfaces, execution contracts, evaluation hooks, safety constraints, and observability requirements
- Participate in platform governance, release coordination, and incident response, supporting investigation and remediation of agent‑related failures
- Implement platform safeguards such as fallback mechanisms, rollback strategies, approval gates, rate limiting, audit trails, and kill‑switch capabilities
- Partner with software engineering, security, IT, and health IT stakeholders to deploy agentic AI capabilities in secure enterprise environments
- Support responsible AI practices through traceability of prompts, policies, tools, models, agent actions, and documentation of known failure modes and limitations
Requirements:
- Bachelor's degree in Computer Science, Software Engineering, Data Science, Physics, Math & Statistics, or another related engineering discipline
- Five years of experience in machine learning engineering, data science, data engineering, and/or software engineering
- Experience building AI or ML platforms that serve multiple downstream teams and production workloads
- Strong proficiency in Python and integration of modern ML frameworks (e.g., PyTorch) with large language models and agent systems
- Hands-on experience with agentic AI frameworks such as LangGraph, LangChain, AutoGen, CrewAI, Semantic Kernel, or equivalent
- Working knowledge of agentic AI protocols and interoperability standards (e.g., MCP, agent-to-agent communication, structured tool invocation)
- Experience implementing planner-executor loops, hierarchical agents, and multi-agent coordination patterns
- Familiarity with workflow orchestration tools (Airflow, Prefect, Temporal) and distributed execution frameworks (Ray or equivalent)
- Experience deploying containerized AI platforms using Kubernetes in enterprise cloud environments with lineage, auditability, and controlled promotion to production
- Ability to reason at the systems and platform level, balancing safety, performance, flexibility, and usability
- Experience designing quantitative evaluation strategies for agentic systems, including success rates, latency, cost, recovery behavior, and safety metrics
- Strong understanding of enterprise data governance, security, and privacy requirements, including healthcare and health IT considerations
- Ability to identify systemic risks stemming from agent autonomy, non-determinism, tool access, and multi-agent interactions
- Experience analyzing failure modes caused by prompt drift, model updates, tool changes, and cross-system dependencies
- Collaborate effectively with architects, applied MLEs, data scientists, software engineers, and IT partners
- Produce clear documentation covering platform architecture, APIs, integration patterns, validation frameworks, and operational runbooks
- Communicate platform capabilities, risks, and limitations to leadership and partner teams
- Contribute to internal standards and shared practices that improve safety, scalability, and consistency of agentic AI development
- Provide hands-on technical guidance, mentorship, and troubleshooting support to platform adopters
- Present technical and non-technical concepts clearly in meetings and institutional forums
- Master's degree or PHD with a concentration in Science, engineering, or related field
- Experience designing, deploying, and maintaining agentic AI systems that operate autonomously and collaboratively across distributed environments
- Experience in monitoring and troubleshooting autonomous agents post-deployment, including performance degradation, clinical incidents, model updates, or corrective actions
- Experience raising the technical bar for team members, such as establishing reproducibility practices, review standards, or shared patterns
- Experience technically evaluating third-party agentic AI platforms within clinical workflows