FLUIX AI is a rapidly growing Enterprise B2B SAAS startup based in the San Francisco Bay Area, specializing in innovative solutions for data centers and facilities using Machine Learning and Artificial Intelligence. They are seeking a skilled Site Reliability Engineer to ensure the reliability and performance of their hybrid platform while collaborating with engineering and data science teams to support their AI/ML infrastructure.

Responsibilities:

Design, implement, and maintain scalable systems while optimizing performance, ensuring high availability and disaster recovery, and assisting with codebase refactoring for modular deployment
Develop and maintain automation tools to streamline operations, improve efficiency, and automate repetitive tasks to enhance system reliability
Collaborate with engineering and data science teams to integrate ML and AI models into production environments, while working with the GenAI community to ensure seamless integration and high performance of cutting-edge models within our technology stack
Respond to and resolve incidents to minimize impact and ensure timely resolution, while conducting post-incident reviews and implementing improvements to prevent recurrence
Create and manage multiple cloud instances (dev, staging, test), optimize cloud infrastructure and data center operations, and ensure the security and compliance of both infrastructure and applications
Identify areas for improvement and drive initiatives to enhance system reliability and performance, while staying updated on industry trends and advancements in SRE practices, ML, and AI technologies

Requirements:

Bachelorʼs degree in Computer Science, Engineering, or a related field (or equivalent experience)
Proven experience as a Site Reliability Engineer or similar role in a SaaS environment, with a strong background in managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies including GenAI model integration, and familiarity with data center operations and manufacturing site integrations
Proficiency in programming and scripting languages (e.g., Python, Go, Bash), experience with containerization and orchestration tools (Docker, Kubernetes), a strong understanding of networking, security, and performance optimization, and knowledge of CI/CD pipelines and DevOps practices
Excellent problem-solving skills with attention to detail, strong communication and collaboration abilities, and the capacity to thrive in a fast-paced, dynamic startup environment

Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: