This is a teaser of an upcoming integration. Join the waitlist to get early access and shape the direction.
What you get
The bulk API will route across 8 extraction providers with fallback on failure. Rows will conform to a user-defined schema; malformed extractions will be quarantined, not silently dropped, so eval sets stay intact.
Planned provenance fields: source_url, fetched_at, language, provider, http_status, content_hash, request_id. Designed so you can dedup on content_hash, filter soft-404s on http_status, and audit any row back to its fetch.
Target quality signals per row: language detection, encoding validation, and content-length stats. Filter before ingestion, then plug into your dedup and quality-scoring stage in Datatrove or NeMo Curator.
Designed to keep long bulk runs moving through transient provider failures without restarting the job, using per-provider rate limiting, circuit breakers, and Temporal-based resume-from-checkpoint.