Grafana Labs is a remote-first, open-source powerhouse with over 20 million users of its visualization tool. They are seeking a Staff Engineer to lead technical initiatives for Tempo, focusing on enhancing operational excellence and evolving the platform for Grafana’s observability products.
Responsibilities:
- Lead multi-quarter technical initiatives from problem framing through rollout, e.g., trace aggregation APIs, Limitless Tempo, autoscaling cells and customer limits, or query engine improvements
- Own the architecture of core Tempo components: ingestion, storage, query, and metrics generation. Drive design reviews, make sharp trade-offs on performance, cost, and complexity, and document the 'why' for the team
- Design APIs for humans and agents. Shape the next generation of Tempo’s interfaces (structured, deterministic, discoverable) so that Act 3 products, LLM-driven assistants, and external integrators can build on Tempo reliably
- Drive operational excellence. Own outcomes against concrete SLOs (P99 write latency, incident recurrence, TCO per ingested GB) and push the team toward Zero Ops through automation, parameterized rollouts, and actionable alerts
- Partner with Product and sibling teams. Work closely with PMs and with App Observability, Asserts, Drilldown, and Grafana Assistant teams to understand how Tempo gets consumed and to ship what unblocks them
- Mentor engineers. Raise the engineering bar through code review, design feedback, pairing on hard problems, and writing that leaves the team smarter than you found it
- Participate in on-call for the services you help build, and be a force multiplier in incident response and post-incident learning
- Contribute to open source. Tempo is OSS. You will engage the community, review external contributions, and help steer the project in the open
Requirements:
- Technical leadership. A track record of leading complex, multi-quarter initiatives that spanned design, delivery, and operations, and made the teams around you better
- Deep systems experience. Substantial hands-on experience building and operating distributed data systems in production: ingestion pipelines, storage engines, query execution, or similar
- Strong software craftsmanship. You write clean, robust, performant software that others can maintain, and you know when to optimize vs. when to ship
- Strong Go, or a path to it. We write Tempo in Go. Deep experience in other systems languages (Rust, C, C++) translates well
- Operational mindset. You've owned production services, carried a pager, reduced toil, and treated SLOs as a product feature, not a chore
- Customer focus and pragmatism. You break complex problems into short feedback loops: analyze, design, deliver an MVP, learn, iterate
- Leadership through writing and collaboration. You lead through design docs, reviews, and shipped code, not hierarchy. You communicate clearly in a fully remote, asynchronous environment
- Experience with tracing, OpenTelemetry, or large-scale observability systems
- Experience designing query languages, SQL/TraceQL-like engines, or APIs intended to be consumed programmatically (by services or agents)
- Experience with columnar storage formats (e.g., Parquet) or purpose-built on-disk formats for analytical workloads
- Experience operating multi-tenant, multi-cell SaaS infrastructure at scale on Kubernetes
- Experience building for AI/LLM consumers: structured APIs, metadata/discovery endpoints, deterministic outputs, evaluation harnesses
- Open-source contribution or maintainership, and comfort engaging a community in the open
- Experience as an on-call user of Grafana, Prometheus, Loki, or Tempo in a previous role (or on a homelab)
- Experience in a fully remote, globally distributed team