Lob is a company focused on transforming the way businesses use direct mail through technology. They are seeking a Senior Platform Engineer to enhance the reliability and performance of their platform infrastructure while focusing on observability engineering and cost efficiency in AWS environments.
Responsibilities:
- Lead observability initiatives across infrastructure and applications
- Design and maintain monitoring, telemetry, dashboards, tracing, and alerting systems
- Build actionable visibility into platform health, reliability, and performance
- Improve incident detection, troubleshooting, and operational response capabilities
- Define observability standards and best practices across engineering teams
- Drive infrastructure cost optimization initiatives across AWS services and platform environments
- Analyze infrastructure utilization and recommend performance and cost efficiency improvements
- Maintain and improve infrastructure-as-code standards and workflows
- Design, build, and maintain scalable performance testing environments and tooling
- Execute and analyze load/performance testing initiatives
- Support and improve Nomad-based orchestration environments
- Troubleshoot complex production and infrastructure issues across distributed systems
- Collaborate closely with engineering teams to improve scalability, reliability, operational visibility, and infrastructure efficiency
- Create and maintain operational documentation and platform best practices
Requirements:
- 7+ years of experience in platform engineering, infrastructure engineering, or site reliability engineering
- Strong hands-on experience with HashiCorp Nomad
- Deep expertise with Datadog
- Strong experience implementing and operating observability platforms using OpenTelemetry and modern monitoring tooling
- Experience with Grafana or similar visualization and observability platforms
- Strong understanding of distributed tracing, metrics, logging, and monitoring best practices
- Experience building dashboards, alerts, telemetry pipelines, and operational visibility tooling
- Strong experience identifying and implementing AWS cost optimization strategies in production environments
- Strong knowledge of S3 optimization, lifecycle management, and storage cost reduction
- Experience building and running performance/load testing environments
- Strong troubleshooting and performance analysis skills across distributed systems
- Strong experience operating infrastructure in AWS environments
- Strong experience with Terraform and infrastructure-as-code practices
- Experience balancing platform reliability, observability, and infrastructure cost efficiency at scale
- Experience working with distributed and event-driven architectures using technologies such as Redis, SQS, or Temporal
- Experience managing and tuning Elasticsearch or OpenSearch clusters
- Experience working in fast-paced engineering environments
- Strong communication and collaboration skills
- Exposure to PostgreSQL RDS to Aurora migrations
- Experience with Kubernetes
- Experience with CI/CD systems and deployment automation
- Experience with Go, Python, or TypeScript