AWSCloudGrafanaJavaPrometheusSplunkEKSRDSDatadogOpenTelemetryCI/CDRemote Work
About this role
Role Overview
Responding to production incidents
Working with business partners responding to application specific questions
Promoting availability, resilience, and stability
Designing and implementing observability solutions, including Monitoring, Logging, Alerting, Distributed tracing using tools such as Prometheus, Grafana, ELK/OpenSearch, OpenTelemetry, Datadog, and Splunk
Instrumenting applications and infrastructure to provide end-to-end visibility into system health, performance, and reliability
Analyzing and reverse-engineering existing applications to understand system behavior, integrations, and dependencies
Continuously evaluating emerging technologies, tools, and industry trends to improve platform reliability and operational efficiency
Requirements
Bachelor’s degree or higher in a technology related field (like Engineering, Computer Science, Information Technology) required
Minimum 5 years of combined experience across Production Support, Application Development (Java), and Site Reliability Engineering (SRE)
Build, manage, and optimize resilient, scalable cloud platforms using AWS-native services
3 years of hands-on experience with Amazon EKS and RDS
Lead and execute cloud migration initiatives
Implement and maintain CI/CD pipelines
Ensure platforms meet high availability, scalability, fault tolerance, and disaster recovery requirements
Design, implement, and continuously improve observability solutions
Proactively identify performance bottlenecks, capacity risks, and failure points
Lead incident response
Conduct root cause analysis (RCA) for critical incidents
Collaborate closely with development, infrastructure, security, and business teams