Treeswift is a company that empowers energy companies to modernize their field work by deploying sensors and processing data through AI models. The Site Reliability and Infrastructure Engineer will help scale and harden the platform that manages pipelines and machine learning training, ensuring operational reliability and observability as data volume grows.
Responsibilities:
- Partner with the data platform and engineering teams to understand how changes propagate across pipeline execution (Astronomer-hosted Airflow DAGs), containerized workers (Kubernetes), and AWS services (S3, SQS, Lambda, Step Functions, ECS)
- Design and implement reliability and observability for high-volume pipeline operations, including:
- actionable monitoring/alerting for DAG/task failures and reruns
- visibility into operational workflows like flight orchestration (including DLQ/failed-message alerting and notification pathways)
- dashboards and SLO/SLI definitions focused on correctness, throughput, and pipeline health
- Own CI/CD guardrails for production changes: build/deploy validation and safe rollout mechanics for Astronomer deployments (image builds pushed to ECR, and Airflow configuration updates via Astronomer CLI variable updates)
- Make machine learning inference operations more reliable and observable:
- instrument inference runs executed inside pipeline runners (model checkpoint resolution, S3 sync behavior, thresholds and fallback behavior, and output correctness)
- add operational visibility for inference outcomes (e.g., unknown classification rates, fallback usage, and failure modes)
- Create operational tooling and continuously improve systems (‘leave it better than you found it’), including:
- runbooks, incident learnings, and engineering standards for debugging at scale
- automate away toil in deployment and operations workflows as we learn what hurts most
- There is not currently an established on-call rotation for this platform, and the pipelines do not require real-time processing. That said, you’ll still help lead reliability improvements and operational readiness—so the team has faster diagnosis, better alerts, and safer releases when issues do occur