Improve resiliency engineering practices across platforms and applications, including resilient application design patterns, system observability and deployment strategies
Incident detection, troubleshooting, and resolution.
Develop automation for incident response and infrastructure management
Develop and support OpenTelemetry integrations for multiple application platforms (browser, ECS, lambda, etc) and languages (JavaScript, Java)
Contribute to architectural decisions and support implementation of solutions.
Requirements
Expertise in JavaScript (server-side and client-side execution environments) or Java.
Working knowledge of Python (or similar scripting language)
Strong knowledge of resiliency engineering techniques for both platforms and applications.
Experience troubleshooting complex production issues and implementing effective mitigations.
Hands-on experience with AWS services and cloud infrastructure.
Familiarity with OpenTelemetry specification and core APIs.
Practical experience developing and operating software in distributed systems environments.