AnsibleAWSAzureCloudDNSDockerGrafanaKubernetesLinuxPrometheusPythonTerraformBashAIGoogle CloudCI/CDLeadershipMentoringRemote Work
About this role
Role Overview
Own and optimize Noxtua's infrastructure across OTC and our self-hosted GPU servers — ensuring efficient architecture, reliable operation, and cost control.
Lead and grow a team of 4–5 DevOps engineers, setting technical direction, supporting their development, and having a strong ownership mindset.
Operate our self-managed GPU server fleet — provisioning, driver installation, hardening, and connectivity via Ansible — and manage provider SLAs to keep heavy AI workloads running reliably.
Build and maintain infrastructure automation using Infrastructure as Code (Terraform & Ansible).
Run our container platform on Kubernetes, support teams with Docker, and keep our services (APIs) stable, accessible, and secure.
Set up and maintain monitoring and alerting (e.g., Prometheus, Grafana) to ensure system reliability and performance.
Develop and maintain CI/CD pipelines and collaborate with the development and AI teams to automate deployments and support AI-driven workloads.
Requirements
Leadership: Experience leading or mentoring a team, setting technical direction, and balancing hands-on operations with people responsibility.
Managing server fleets: You've managed a fleet of servers and understand the methodology behind it — not just rented cloud instances.
Experience with GPU servers is a strong plus, but not required.
Strong proficiency in Linux and Bash, plus a scripting language such as Python.
Proven track record designing, operating, and cost-managing cloud-based architectures — ideally OTC (Open Telecom Cloud), or transferable experience from AWS, Azure, or Google Cloud — with solid networking fundamentals (DNS, OSI model).
Strong focus on automating provisioning and configuration with Terraform and Ansible.
Expertise in containerizing applications with Docker and running them at scale on Kubernetes.
Able to set up and maintain monitoring/alerting tools (e.g., Prometheus, Grafana), aggregate data, visualize insights, and derive actions.
Tech Stack
Ansible
AWS
Azure
Cloud
DNS
Docker
Grafana
Kubernetes
Linux
Prometheus
Python
Terraform
Benefits
100% remote work possible (given a German residence), other countries upon request
Flexible working hours
Vacation: 26 days + December 24th & 31st off, + 1 additional vacation day per year of employment (up to 30 days)
Discounts: e.g., Urban Sports Club Membership, depending on location
Equipment: Laptop (Lenovo or Mac), plus €1,000 net home office setup budget (paid with your first salary)