Search...

Research Crawling Engineer

Skills

About the Role

You will design, build, and operate large-scale web data acquisition systems for research and model development. You will implement and maintain distributed crawlers, handle anti-bot and rate-limiting challenges, extract and normalize data from dynamic sites, and build pipelines for cleaning, deduplication, and dataset construction. You will monitor crawl performance and data quality, optimize infrastructure for cost and reliability, and collaborate with researchers to meet modeling needs.

Requirements

  • Strong programming experience in one or more of Go Rust Python Java or C++
  • Experience building web crawlers or large-scale data pipelines
  • Solid understanding of HTTP networking and browser behavior
  • Familiarity with distributed systems and parallel processing
  • Experience working with large datasets (TB–PB scale preferred)
  • Ability to debug unstable or adversarial environments

Responsibilities

  • Build and maintain large-scale web crawlers
  • Design high-throughput fault-tolerant systems for data collection
  • Handle anti-bot systems rate limits and dynamic JS-heavy sites
  • Develop pipelines for cleaning deduplication filtering and normalization
  • Construct and maintain datasets for research and model training
  • Monitor crawl performance coverage and data quality and iterate quickly
  • Collaborate with research teams to align data collection with modeling needs
  • Optimize infrastructure for cost latency and reliability
  • Own end-to-end data acquisition pipelines

Benefits

  • Benefits package
  • Equity package
  • Fully remote work