For ML data engineersPreview · in development

Web-to-JSONL training data pipelines

Planned: domain-specific dataset curation with schema-aligned rows, provenance on every record, and fallback-resilient extraction across 8 providers.

This is a teaser of an upcoming integration. Join the waitlist to get early access and shape the direction.

What you get

Key capabilities

Schema-aligned rows with quarantine

The bulk API will route across 8 extraction providers with fallback on failure. Rows will conform to a user-defined schema; malformed extractions will be quarantined, not silently dropped, so eval sets stay intact.

Provenance on every row

Planned provenance fields: source_url, fetched_at, language, provider, http_status, content_hash, request_id. Designed so you can dedup on content_hash, filter soft-404s on http_status, and audit any row back to its fetch.

Quality signals exposed

Target quality signals per row: language detection, encoding validation, and content-length stats. Filter before ingestion, then plug into your dedup and quality-scoring stage in Datatrove or NeMo Curator.

Resilient at scale

Designed to keep long bulk runs moving through transient provider failures without restarting the job, using per-provider rate limiting, circuit breakers, and Temporal-based resume-from-checkpoint.

Shape the direction with us

Join the waitlist. Early adopters get direct input on scope and priorities before GA.