Dice is seeking a skilled Machine Learning Engineer with a strong Site Reliability Engineering (SRE) mindset to join their team. The ideal candidate will have hands-on experience maintaining applications on both Windows and Linux environments, managing on-premises servers, and working with Kubernetes clusters.
Responsibilities:
- Maintain and support machine learning applications running on Windows and Linux servers in on-premises environments
- Manage and troubleshoot Kubernetes clusters hosting ML workloads
- Collaborate with data scientists and engineers to deploy machine learning models reliably and efficiently
- Implement and maintain monitoring and alerting solutions using DataDog to ensure system health and performance
- Debug and resolve issues in production environments using Python and monitoring tools
- Automate operational tasks to improve system reliability and scalability
- Ensure best practices in security, performance, and availability for ML applications
- Document system architecture, deployment processes, and troubleshooting guides
Requirements:
- Proven experience working with Windows and Linux operating systems in production environments
- Hands-on experience managing on-premises servers and Kubernetes clusters and Docker containers
- Strong proficiency in Python programming
- Solid understanding of machine learning concepts and workflows
- Experience with machine learning model deployment and lifecycle management
- Familiarity with monitoring and debugging tools, e.g. DataDog
- Ability to troubleshoot complex issues in distributed systems
- Experience with CI/CD pipelines for ML applications
- Familiarity with AWS cloud platforms
- Background in Site Reliability Engineering or DevOps practices
- Strong problem-solving skills and attention to detail
- Excellent communication and collaboration skills
- Familiarity with model development