Quantiphi is an award-winning, AI-First global digital engineering company that helps leading Fortune 1000 organizations transform bold ideas into measurable business impact. They are seeking a highly experienced Senior DevOps/Observability Engineer to lead the design and implementation of a unified observability platform, focusing on architecting sophisticated observability pipelines and deploying various monitoring tools.
Responsibilities:
- Lead the design and implementation of our next-generation, unified observability platform
- Focus on architecting a sophisticated observability pipeline from the ground up, leveraging a modern, open-source-centric stack on Amazon Web Services (AWS)
- Deploying, configuring, and integrating a suite of tools including Prometheus, Grafana, and Splunk to provide comprehensive insights into our complex, distributed systems
- Building scalable, reliable, and efficient monitoring and logging systems
Requirements:
- Unified Pipeline Architecture: Proven ability to design and implement end-to-end observability pipelines using OpenTelemetry, Prometheus, and Grafana on centralized infrastructure
- Cross-Account AWS Observability: Deep expertise in centralizing AWS telemetry, including multi-account CloudTrail organization trails, cross-account CloudWatch metrics/logs, and VPC Flow Logs
- Log Aggregation & Routing: Strong experience designing log aggregation strategies, implementing noise reduction/filtering at the collector level, and configuring Splunk HTTP Event Collector (HEC) integrations
- Advanced Alerting & Dashboarding: Hands-on experience building comprehensive alerting frameworks using Alertmanager and CloudWatch Alarms, coupled with advanced dashboard engineering in Grafana (using PromQL)
- Infrastructure as Code (IaC): Advanced proficiency in writing Terraform modules specifically for deploying and managing observability stacks and EC2 infrastructure
- Enterprise Scale Log Management: Demonstrated experience managing, routing, and optimizing log pipelines at massive scale (TB/day)
- Kubernetes/Container Observability: Experience deploying Prometheus and OTel within Kubernetes (EKS) or containerized (ECS) environments
- Cost Optimization: Proven track record of reducing observability spend through strategic metric dropping, log filtering, and efficient storage tiering