Build and evolve the internal platform infrastructure that enables engineering teams to deploy, test, monitor, and scale services independently
Improve the software delivery lifecycle by developing reliable CI/CD workflows for both application and AI-related workloads
Establish and maintain reliability standards across the platform through effective monitoring, alerting, and performance measurement practices
Take an active role in incident response, troubleshooting production issues, and driving post-incident improvements focused on long-term stability
Identify bottlenecks and operational risks across infrastructure and distributed systems, proactively implementing solutions to improve resilience and scalability
Develop automation and internal tooling to streamline operations, reduce manual intervention, and improve platform efficiency
Oversee and optimize cloud-native infrastructure running on GCP, including containerized and event-driven services
Support and optimize large-scale search and indexing systems powered by Elasticsearch
Enhance edge infrastructure, security, and traffic management through Cloudflare configuration and optimization
Contribute to infrastructure cost optimization initiatives while maintaining strong performance and reliability standards
Collaborate closely with Product, Engineering, and AI teams to support technical initiatives, platform requirements, and long-term architectural decisions
Requirements
5+ years of experience working as a Platform Engineer, Site Reliability Engineer, or Senior/Staff Software Engineer in production environments
Strong hands-on experience with GCP or equivalent cloud platforms in scalable, high-availability systems
Deep expertise with Kubernetes (GKE preferred), Terraform, and infrastructure-as-code practices
Proven experience designing and maintaining CI/CD pipelines using tools such as GitHub Actions, Cloud Build, ArgoCD, or similar
Strong understanding of modern observability practices, including metrics, logging, tracing, and alerting systems (OpenTelemetry or equivalent)
Experience operating and supporting distributed systems with a strong focus on reliability, scalability, and operational excellence
Comfortable participating in on-call rotations and leading incident management processes in production-critical environments
Strong communication and collaboration skills
Ability to operate autonomously and make sound technical decisions in fast-paced, ambiguous startup environments