Alpaca is a US-headquartered self-clearing broker-dealer and brokerage infrastructure for various financial products. As a Site Reliability Engineer, you will ensure the reliability and performance of systems and services, collaborating with development and operations teams to maintain robust applications.

Responsibilities:

Triage difficult technical problems and implement solutions
Improve our observability stack (monitoring, logging, profiling)
Incident Management: Respond to and resolve incidents in a timely manner, conducting post-incident reviews to identify and implement improvements
Collaboration: Work closely with development teams to ensure new features and services are designed with reliability and scalability in mind
Capacity Planning: Monitor system capacity and performance, making recommendations and implementing changes to handle future growth

Requirements:

5+ years of experience in Site Reliability Engineering, Performance Engineering, or similar roles
5+ years of experience with multi-terabyte scale PostgreSQL clusters
Proven track record of managing and maintaining large-scale, high-availability, and high-performance PostgreSQL database
Experience designing and implementing SLIs, SLOs, and SLAs for internal systems and databases
Experience with troubleshooting PostgreSQL performance problems and slow queries
Extensive experience with efficient schema design and efficient query design
Experience migrating multi-terabyte tables into more efficient schemas
Proficient with Go
Proficient with Prometheus
Proficient with Linux
Knowledgeable in trading/fintech domains
Experience with low-latency systems
Experience with distributed tracing
Experience scaling PostgreSQL clusters rapidly
Experience with pgx, gorm, or sqlc

Staff Site Reliability Engineer, Database

Key skills

About this role

Responsibilities:

Requirements: