Poolside is a company focused on building a world where AI drives economically valuable work and scientific progress. The role involves working with the pre-training data team to acquire high-quality pre-training data for frontier models, ensuring alignment with training needs and building systems for efficient data acquisition.

Responsibilities:

Design, build, and operate a large-scale web crawler responsible for acquiring all openly accessible data on the internet
Develop specialized deep crawlers targeting high-value sources to improve recall and coverage
In collaboration with data researchers, own a long-term road map for data acquisition
Build observability, monitoring, and debugging tooling to ensure reliability and transparency across crawl infrastructure
Collaborate with pre-training, post-training, and evaluations teams to align data acquisition priorities with model training needs
Build high-throughput ingestion pipelines for rapidly onboarding partner data and evaluating it for quality

Requirements:

Strong distributed systems background with proven experience building and operating large-scale infrastructure — data pipelines, web crawlers, or similar
Proficiency in Python, and comfortable optimizing performance and debugging complex systems under production conditions
Hands-on experience with web crawling or large-scale data extraction: understanding of HTTP protocols, distributed job queues, and data parsing at scale
Familiarity with cloud platforms (AWS) and container orchestration (Kubernetes, Docker) for deploying and managing high-throughput workloads
Awareness of the non-technical dimensions of internet-scale crawling: data privacy, robots.txt adherence, and responsible crawl practices
Prior experience pre-training LLMs
Experience in building trillion-scale SOTA pre-training datasets
Experience translating research to production at scale

Member of Engineering (Pre-training / Data Acquisition)

Key skills

About this role

Responsibilities:

Requirements: