EPAM Systems is a large Wealth Management firm seeking an experienced Site Reliability Engineer to support feature development on its newly built Trading Platform. The role involves implementing DevOps and SRE best practices, managing monitoring solutions, and collaborating with application teams to ensure performance and availability.

Responsibilities:

Implement and champion DevOps and SRE best practices across the organization
Drive technology roadmap discussions for the SRE team
Define, craft, and maintain SLIs and SLOs, along with key metrics including MTTR, Lead Time for Change, Deployment Frequency, and Change Failure Rate
Design, develop, and manage monitoring, alerting, and observability solutions using Dynatrace, Splunk, and Grafana
Conduct performance assessments, identify bottlenecks, and recommend enhancements to improve system performance
Partner with application teams to enforce performance and availability SLAs
Collaborate with product owners to manage error budgets, prioritize toil backlogs, and validate against team, application, and incident metrics
Participate in an on-call rotation to respond to production events and outages
Continuously improve CI/CD pipelines and deployment processes
Lead troubleshooting efforts, incident management, and root cause analysis
Identify and build automated processes wherever possible
Implement cybersecurity measures through ongoing vulnerability assessments and risk management
Provide periodic progress reports to management and stakeholders
Partner with application teams to support and ease their adoption of the platform
Facilitate clear coordination and communication within the team and with customers
Analyze existing systems and develop plans for enhancements and improvements

Requirements:

Bachelor's degree in Computer Science or a related field, and/or equivalent work experience
5+ years of experience working within DevOps or SRE teams
Proven experience supporting production infrastructure
Strong knowledge of CI/CD principles and pipelines
Solid understanding of observability concepts, including monitoring, logging, and tracing
Hands-on experience with Dynatrace and Splunk
Experience with at least one major cloud provider (AWS, Azure, or GCP)
Demonstrated experience operating high-availability, fault-tolerant, scalable, and distributed systems in production

Site Reliability Engineer (SRE)

Key skills

About this role

Responsibilities:

Requirements: