Docusign is a leading company in e-signature and contract lifecycle management, transforming into an Intelligent Agreement Management platform. The Principal Software Engineer will lead the evolution of the Site Reliability Engineering organization, focusing on building reliable systems and mentoring other engineers.
Responsibilities:
- Lead and code with the team
- Lead the cultural and technical shift toward treating reliability as a product feature
- Move the org away from reactive "ops" work toward building durable platforms and self-healing systems
- Possess elite Incident Commander skills while not expected to be in the daily on-call rotation, stepping in during high-stakes outages to bring calm and clarity, and use those experiences to architect systems that ensure those incidents never happen again
- Define the "Golden Paths" for our Cloud migration, ensuring that as Docusign scales globally, our architecture remains "Multi-Active" and impervious to regional cloud failures
- Challenge the status quo, mentoring Senior and Staff SREs to think like software architects
- Advocate for "Error Budgets" that have real teeth, influencing product roadmaps to prioritize long-term stability
Requirements:
- 15+ years of experience in large-scale distributed systems, software engineering, or infrastructure roles, with a track record of driving system architecture
- Experience as a software engineer by trade with deep proficiency in Go or Python, possessing a 'code-first' approach and a passion for writing production-grade automation services alongside the engineering team
- Experience with proven technical leadership in building global, active-active distributed systems at hyperscale, functioning simultaneously as an architect and an engineering peer
- Experience with production-hardened mastery of Kubernetes and Terraform to manage complex, multi-tenant cloud topographies
- Experience acting as a primary Lead Incident Commander for tier-0 global outages, with the ability to translate operational chaos into actionable technical stabilization
- Experience defining 'Developer Experience' strategies and contributing to Internal Developer Platforms (IDPs) that bake resilience and infrastructure abstractions directly into developer workflows
- Technical expertise executing high-stakes on-premises to cloud migrations natively within Microsoft Azure (specifically utilizing Azure Kubernetes Service / AKS and Azure traffic routing)
- Hands-on experience architecting global distributed tracing capabilities using the OpenTelemetry ecosystem to track deep, user-centric SLO metrics across microservices
- Experience developing self-healing infrastructure patterns through a blend of deterministic code and AI-assisted/predictive anomaly remediation models
- Experience championing and setting up automated fault-injection frameworks to proactively prove system recoverability before a real production blast radius occurs
- Experience building safe deployment architectures (Canary, Blue/Green) managed via secure pipelines (GitHub Actions, Azure DevOps) with automated safety policies embedded directly into the code lifecycle