TherapyNotes is a leading provider of behavioral health Practice Management and EHR software, seeking a Database Site Reliability Engineer. In this role, you will ensure the reliability and operability of PostgreSQL services for a 24x7 SaaS platform, collaborating with cross-functional teams to enhance performance and incident response.
Responsibilities:
- Responsible to design, implement, and maintain high-availability, high throughput, data and compute intensive, critical database systems running PostgreSQL which supports a growing 24x7 SaaS platform
- Define and improve database service reliability through monitoring/alerting, SLO-oriented metrics, and operational readiness
- Participate in and help drive incident response, root cause analysis, and post-incident corrective actions for database-related production events
- Partner with other technical leaders to ensure all newly introduced systems are supportable and maintainable by both development and operations
- Provides escalated technical guidance and support to other technology teams throughout the organization
- Provides on-call coverage for production support and other duties as required
- Accountable for complying with HIPAA security policies within the database platform
- Ensure all solutions and operational activities adhere to the security and operating policies established by the organization
- Own and continuously improve our Datadog database observability by building actionable dashboards, alerts, and service-level views using an observability stack (e.g., Prometheus, Grafana, New Relic, or equivalent). Familiarity with PGAnalyze or Percona a plus
- Automate system maintenance tasks using Bash, Powershell, Python, or Ansible. Manage infrastructure as code (IaC) writing Ansible playbooks. Some exposure to Terraform a plus
- Experience with writing & designing ETL pipelines using Python a plus
- Understand and maintain various PostgreSQL ecosystem components like: PgBouncer, PgBackrest, HaProxy, RepMgr a plus
- Excellent communication and interpersonal skills
Requirements:
- Strong skill set in managing PostgreSQL
- Reliability and operability of PostgreSQL services supporting a growing 24x7 SaaS platform
- Emphasis on availability, performance, observability, incident response, and automation
- Design, implement, and maintain high-availability, high throughput, data and compute intensive, critical database systems running PostgreSQL
- Define and improve database service reliability through monitoring/alerting, SLO-oriented metrics, and operational readiness
- Participate in and help drive incident response, root cause analysis, and post-incident corrective actions for database-related production events
- Partner with other technical leaders to ensure all newly introduced systems are supportable and maintainable by both development and operations
- Provides escalated technical guidance and support to other technology teams throughout the organization
- Provides on-call coverage for production support and other duties as required
- Accountable for complying with HIPAA security policies within the database platform
- Ensure all solutions and operational activities adhere to the security and operating policies established by the organization
- Own and continuously improve Datadog database observability by building actionable dashboards, alerts, and service-level views using an observability stack
- Automate system maintenance tasks using Bash, Powershell, Python, or Ansible
- Manage infrastructure as code (IaC) writing Ansible playbooks
- Excellent communication and interpersonal skills
- BS degree in Information Systems, Engineering, or equivalent experience
- 7-10+ years of Engineering experience with Database Engineering, Systems Engineering, DevOps and/or SRE
- Experience in cloud-based compute, storage, and containerization solutions
- Proficiency with operating PostgreSQL in a Linux environment
- Expertise with an observability/monitoring platform
- Experience working in Agile/DevOps environments and operating production services with ITSM practices where applicable
- Familiarity with PGAnalyze or Percona
- Some exposure to Terraform
- Experience with writing & designing ETL pipelines using Python
- Understand and maintain various PostgreSQL ecosystem components like: PgBouncer, PgBackrest, HaProxy, RepMgr
- Azure & Kubernetes preferred
- Datadog experience is a plus