Triple Whale is a complete intelligence platform for ecommerce, helping brands make data-driven decisions to drive growth and efficiency. They are seeking a Senior Cloud Backend Engineer to join their Infrastructure Team, focusing on building reliable and scalable systems, supporting service infrastructure, and participating in on-call rotations for platform reliability.
Responsibilities:
- Deploy and support our service infrastructure in Kubernetes
- Identify the right tools and technologies for major initiatives and then build them
- Help other teams and developers design robust and scalable systems
- Scale and optimize multiple databases
- Build internal tooling that accelerates developer velocity
- Provide observability, monitoring, and visibility across our systems
- You will participate in a shared on-call rotation, typically 2–3 times per month, covering the period from Friday at 7:00 AM ET through Saturday at 5:00 PM ET
- During your rotation, you are the primary point of escalation for production issues
- On-call is not a daily responsibility, only during your assigned weekends
- Strong understanding of system architecture and cross-service dependencies
- Previous real-world experience in production on-call environments
- Ability to quickly assess incidents, identify scope/root cause, and understand platform impact
- Ability to classify severity and prioritize response appropriately
- Capability to deploy safe production hotfixes when needed
- Solid judgment under pressure - especially when operating independently
- Ownership mindset: from detection to mitigation to resolution
Requirements:
- You are located in the New York tri-state area
- 3+ years of experience as an independent backend or infrastructure engineer
- Ability to design and build scalable, reliable systems
- Strong communication skills
- Hands-on builder mentality — this is a coding role
- Experience with relational and non-relational databases
- Experience with major Cloud platforms (GCP, AWS, Azure), GCP is an advantage
- Experience with streaming systems
- Experience with scaling large systems
- Experience with message queues
- Experience with monitoring systems like DataDog, Grafana, Groundcover
- Experience with CI/CD, Git
- Strong understanding of system architecture and cross-service dependencies
- Previous real-world experience in production on-call environments
- Ability to quickly assess incidents, identify scope/root cause, and understand platform impact
- Ability to classify severity and prioritize response appropriately
- Capability to deploy safe production hotfixes when needed
- Solid judgment under pressure - especially when operating independently
- Ownership mindset: from detection to mitigation to resolution
- Kubernetes and Knative (production experience)
- ClickHouse