Langfuse is an Open Source LLM Engineering Platform that helps teams build useful AI applications via tracing, evaluation, and prompt management. They are seeking a Senior Cloud Infrastructure Engineer to own the operations of Langfuse Cloud, ensuring uptime, performance, and cost efficiency while managing deployments and scaling the infrastructure to meet growing demands.
Responsibilities:
- Own Langfuse Cloud operations: You'll run our production environments on AWS ECS Fargate and ClickHouse Cloud. You'll manage deployments, autoscaling, capacity planning, and cost optimization — making sure we stay fast and affordable as traffic scales
- Build world-class observability: You'll own our Datadog setup end to end — dashboards, alerts, and SLOs. When something degrades, you'll ensure we know before our customers do. You'll build the monitoring culture that lets the whole team ship with confidence
- Make self-hosting effortless: Thousands of teams run Langfuse on their own infrastructure. You'll own and evolve our Helm chart, Docker Compose configuration, and deployment documentation. You'll turn 'works on my machine' into 'works on every machine' — from a single-node setup to a multi-region enterprise deployment
- Automate everything: CI/CD pipelines, infrastructure-as-code, automated scaling, zero-downtime deployments. You'll replace manual processes with automation that makes the team faster and the platform more reliable
- Scale for what's next: We're growing fast and new product directions — like complex long-running agent observability and real-time evaluation — push the infrastructure in new ways. You'll be thinking ahead about what breaks at 10x scale and building the foundation before we get there. 10x is always just one quarter away here at Langfuse
- Harden security and compliance: As more enterprises adopt Langfuse, you'll help ensure our cloud and self-hosted deployments meet the security and compliance bar that large organizations require