ICD Portal is part of Tradeweb, a global leader in electronic trading across asset classes. As a Senior Site Reliability Engineer (SRE), you will ensure the reliability and seamless operation of Tradeweb's global platform and AWS infrastructure, focusing on developing highly reliable software systems and improving overall system performance.
Responsibilities:
- High performance Engineering Organization: contribute to an Agile-Agentic organization, planning, grooming, story ideation that will lead to iterative improvements of our platform
- Security: Prioritize security in all aspects of work, ensuring that it is the foundational consideration in every task performed
- IaC Automation and Tooling: Practice GitSecOps by contributing to the development and delivery of a highly available platform through automation. Continually improve the reliability and efficiency of systems through iterative processes while reducing toil
- Reliability Engineering: Work to ensure the reliability and availability of systems. Develop and maintain monitoring tools, analyze system performance, and implement solutions to improve overall system reliability
- Incident Triage and Resolution: Triage issues, assess risk, and prioritize remediation with service teams. Take full ownership and drive resolution of production, quality engineering and development-related infrastructure issues
- Communication and Collaboration: Effectively communicate issue statuses to both R&D and non-technical audiences. Ability to manage context switching when required. Collaborate closely with software development teams to influence architecture and design decisions that impact the reliability and performance of systems
- Observability: Develop observability tools to fulfill the needs of SLOs. Define and measure Service Level Objectives (SLOs) to ensure that the systems meet reliability standards
- On-call Responsibilities: Fulfill regular on-call duties to enable high system availability
Requirements:
- 6+ years of equivalent technology operations and engineering experience (ArgoCD, Kustomize, Pulumi, K8s, LGTM)
- 4+ years of scripting/coding experience in any modern language (Python Preferred)
- 4+ years as an SRE or similar individual-contributor role supporting public cloud (AWS) and cloud native technologies (Lambda, EKS, SNS, SMS, etc.)
- Bachelor's Degree or higher in Computer Science or related field
- Cloud-based virtualization expertise, particularly with AWS native services
- Strong multitasking skills in a dynamic environment
- Proven ability to work independently with a proactive, task-ownership approach, applying critical and creative thinking
- Collaborative mindset, adept at negotiating, influencing, and developing partnerships within a team environment
- Knowledge in information, network and Internet security, including threat modeling, cloud architecture, web protocols, and common attack surfaces
- Deep understanding of Linux/Unix tools and architecture
- Demonstrated proficiency in designing, implementing, and troubleshooting diverse network infrastructures, with comprehensive knowledge of protocols, performance and routing