Okta is a company that secures AI by building trusted infrastructure for organizations. They are seeking a Senior Database Reliability Engineer (DBRE) to design, operationalize, and optimize the data persistence layer for their mission-critical systems, working closely with various teams to ensure performance and reliability.
Responsibilities:
- Design, implement, and operate highly available PostgreSQL clusters (physical replication, logical replication, sharding/partitioning, failover automation)
- Optimize query performance, indexing strategies, schema design, and storage engines
- Perform capacity planning, growth forecasting, and workload modeling
- Own high-availability strategies including automatic failover, multi-AZ/multi-region setups, and disaster recovery
- Develop automation for any and all tasks including but not limited to: provisioning, configuration, backups, failovers, vacuum tuning, and schema management using tools such as Terraform, Ansible, Kubernetes Operators, or custom tooling
- Build monitoring, alerting, and self-healing systems for PostgreSQL and MySQL
- Lead response during database incidents—performance regressions, replication lag, deadlocks, bloat issues, storage failures, etc
- Conduct root-cause analysis and implement permanent fixes
- Partner with software engineers to review SQL, optimize schemas, and ensure efficient use of PostgreSQL features
- Provide guidance on database-related design patterns, migrations, version upgrades, and best practices
Requirements:
- 4 plus years of hands-on PostgreSQL experience in high-volume, distributed, or large-scale production environments
- Strong knowledge of PostgreSQL internals (WAL, MVCC, bloat/vacuum tuning, query planner, indexing, logical replication)
- Production experience with MySQL (InnoDB internals, replication, performance tuning)
- Advanced SQL and strong understanding of schema design and query optimization
- Experience with Linux systems, networking fundamentals, and systems troubleshooting
- Experience building automation with Go or Python
- Production experience with monitoring tools (Prometheus, Grafana, Datadog, PMM, pg_stat_statements, etc.)
- Hands-on experience with cloud environments (AWS or GCP)
- Experience with PgBouncer, HAProxy, or other connection-pooling/load-balancing layers
- Exposure to event streaming (Kafka, Debezium) and change data capture
- Experience supporting 24/7 production environments with on-call rotation
- Contributions to open-source PostgreSQL ecosystem