Alpaca is a US-headquartered self-clearing broker-dealer and brokerage infrastructure for various financial products. As a Site Reliability Engineer, you will ensure the reliability and performance of systems and services, collaborating with development and operations teams to maintain robust applications.
Responsibilities:
- Triage difficult technical problems and implement solutions
- Improve our observability stack (monitoring, logging, profiling)
- Incident Management: Respond to and resolve incidents in a timely manner, conducting post-incident reviews to identify and implement improvements
- Collaboration: Work closely with development teams to ensure new features and services are designed with reliability and scalability in mind
- Capacity Planning: Monitor system capacity and performance, making recommendations and implementing changes to handle future growth
Requirements:
- 5+ years of experience in Site Reliability Engineering, Performance Engineering, or similar roles
- 5+ years of experience with multi-terabyte scale PostgreSQL clusters
- Proven track record of managing and maintaining large-scale, high-availability, and high-performance PostgreSQL database
- Experience designing and implementing SLIs, SLOs, and SLAs for internal systems and databases
- Experience with troubleshooting PostgreSQL performance problems and slow queries
- Extensive experience with efficient schema design and efficient query design
- Experience migrating multi-terabyte tables into more efficient schemas
- Proficient with Go
- Proficient with Prometheus
- Proficient with Linux
- Knowledgeable in trading/fintech domains
- Experience with low-latency systems
- Experience with distributed tracing
- Experience scaling PostgreSQL clusters rapidly
- Experience with pgx, gorm, or sqlc