Job Title - Network SRE (Site Reliability Engineer Networking)
Location-Phoenix, AZ (Hybrid)
Experience - 10 16+ Network Engineering with SRE / Automation focus
Role Summary
- We are seeking a Network SRE to ensure the reliability, scalability, and performance of cloud and hybrid network platforms.
- This role applies SRE principles to networking by shifting from manual network operations to automated, observable, and resilient network services.
The ideal candidate is a network engineer who thinks like a software engineer and SRE.
Key Responsibilities
- Network Reliability Engineering
- Define SLIs, SLOs, and Error Budgets for network services.
- Design networks for:
- High availability
- Fault tolerance
- Low latency
- Predictable performance
- Improve network reliability while reducing operational toil.
Cloud & Hybrid Networking
- Architect and operate AWS networking:
- VPCs, Subnets, Route Tables
- Transit Gateway
- NAT, IGW
- PrivateLink, VPC Endpoints
- Design hybrid connectivity:
- VPN
- Direct Connect
- Support multi-account and multi-region architectures.
Network Observability & Monitoring
Build deep network observability using:
- VPC Flow Logs
- CloudWatch
- Datadog
- Prometheus / Grafana
- Analyze packet loss, latency, and throughput.
- Implement proactive alerting based on SLOs.
- Correlate network signals with application performance.
Automation & Infrastructure as Code
Automate network provisioning and changes using:
- Terraform / CloudFormation
- Implement CI/CD for network changes.
- Reduce manual configuration and human error.
- Version-control network definitions.
- Incident Response & Troubleshooting
- Lead network-related incident response.
- Perform deep root-cause analysis for:
- Packet drops
- Routing issues
- DNS failures
- Load balancer degradation
- Participate in on-call rotation and post-incident reviews.
- Drive permanent fixes rather than workarounds.
Security & Traffic Management
Design and enforce:
- Network segmentation
- Zero-Trust principles
- Firewall rules (Security Groups, NACLs)
- Implement secure ingress/egress patterns.
- Support DDoS protection (AWS Shield, WAF).
- Work with Security teams on audits and remediation.
- Performance & Capacity Planning
- Conduct traffic modeling and capacity forecasting.
- Tune load balancers (ALB, NLB).
- Optimize routing and failover strategies.
- Validate resilience through failure testing.
Collaboration & Enablement
Partner with:
- Cloud Platform teams
- Application SREs
Security & Infra teams
- Enable application teams with network best practices.
- Produce architecture diagrams, runbooks, and SOPs.
- Influence platform design decisions.
Required Skills & Qualifications
Must-Have
- Strong networking fundamentals (TCP/IP, DNS, BGP, routing)
- AWS networking expertise
- SRE concepts & practices
- Network observability & monitoring
- Infrastructure as Code
- Production incident handling experience