Back to jobs
I
Member of Technical Staff, Data Infrastructure
San Mateo, USAPosted 3 months ago
Full-timeremote
Job Description
The Role
We seek experienced engineers to architect and scale the core infrastructure behind distributed training pipelines and petabyte-scale data catalogs. You'll work directly with researchers to accelerate experiments, develop new datasets, improve infrastructure efficiency, and enable key insights across our data assets.
Key Responsibilities
- Design, build, and operate scalable, fault-tolerant infrastructure for LLM research: distributed compute, data orchestration, and storage across modalities.
- Develop high-throughput systems for data ingestion, processing, and transformation — including training data catalogs, deduplication, quality checks, and search.
- Build systems for web crawling, data ingestion, and real-time data processing to support model training operations.
- Develop tools and frameworks for efficient data storage, retrieval, and versioning across distributed systems.
- Ensure data collection adheres to privacy regulations.
Qualifications
- BS/MS/PhD in Computer Science, Machine Learning, or a related field (or equivalent experience).
- 3+ years of experience building data processing pipelines at scale, particularly with AI/ML applications.
- Strong proficiency in Python and experience with data processing frameworks (Apache Spark, Beam, Airflow).
- Familiarity with synthetic data generation techniques and data augmentation strategies.
- Familiarity with web scraping, crawling technologies, and Common Crawl datasets.
- Solid understanding of machine learning fundamentals and experience with ML frameworks (PyTorch, TensorFlow).
- Experience with SQL and NoSQL databases for managing structured and unstructured data.
Preferred Skills
- Experience with large language models and understanding of tokenization, embeddings, and model architectures.
- Experience managing human annotation workflows and quality control processes.
- Experience with vector databases and embedding-based retrieval systems.
- Knowledge of data privacy regulations and ethical AI practices.
- Experience with distributed computing and large-scale data storage systems (HDFS, S3, BigQuery).