Dice is seeking a skilled Machine Learning Engineer with a strong Site Reliability Engineering (SRE) mindset to join their team. The ideal candidate will have hands-on experience maintaining applications on both Windows and Linux environments, managing on-premises servers, and working with Kubernetes clusters.

Responsibilities:

Maintain and support machine learning applications running on Windows and Linux servers in on-premises environments
Manage and troubleshoot Kubernetes clusters hosting ML workloads
Collaborate with data scientists and engineers to deploy machine learning models reliably and efficiently
Implement and maintain monitoring and alerting solutions using DataDog to ensure system health and performance
Debug and resolve issues in production environments using Python and monitoring tools
Automate operational tasks to improve system reliability and scalability
Ensure best practices in security, performance, and availability for ML applications
Document system architecture, deployment processes, and troubleshooting guides

Requirements:

Proven experience working with Windows and Linux operating systems in production environments
Hands-on experience managing on-premises servers and Kubernetes clusters and Docker containers
Strong proficiency in Python programming
Solid understanding of machine learning concepts and workflows
Experience with machine learning model deployment and lifecycle management
Familiarity with monitoring and debugging tools, e.g. DataDog
Ability to troubleshoot complex issues in distributed systems
Experience with CI/CD pipelines for ML applications
Familiarity with AWS cloud platforms
Background in Site Reliability Engineering or DevOps practices
Strong problem-solving skills and attention to detail
Excellent communication and collaboration skills
Familiarity with model development

Machine Learning Engineer (SRE Focus)

Key skills

About this role

Responsibilities:

Requirements: