For RAG engineersPreview · in development

Clean web data for your RAG pipeline

Planned: strip boilerplate, skip unchanged pages, and emit LLM-ready markdown with metadata, ready to chunk and embed with your preferred splitter.

This is a teaser of an upcoming integration. Join the waitlist to get early access and shape the direction.

What you get

Key capabilities

Clean markdown, not HTML soup

Raw HTML typically carries 3-5x more tokens than the readable content: navigation, scripts, inline styles, tracking. The API will strip boilerplate and emit semantic markdown that preserves headings, lists, and tables.

Metadata and change detection

Each page will return URL, fetch timestamp, language, and a content hash over normalized markdown. Target behaviour: skip unchanged pages on re-crawl and avoid re-embedding work you already did.

Structured extraction, not blind chunking

Define the fields you care about (title, section, body, price) and the API will return them as discrete records with stable ids. Chunk with your preferred splitter; the service is designed not to force a chunking strategy on you.

Cost-aware routing across 8 providers

Planned routing: each URL will try the cheapest provider that works across Firecrawl, Jina, Brightdata, Zyte, Scrapingbee, Oxylabs, ScraperAPI, and Apify, with automatic failover. One API, one bill, no per-provider rewrites when one breaks.

Pipeline

How it fits in your stack

Web pages -> webscraping.app -> clean markdown + records + metadata -> embeddings (OpenAI, Cohere, Jina, Voyage) -> vector DB (Pinecone, Qdrant, pgvector) -> RAG query engine.

Shape the direction with us

Join the waitlist. Early adopters get direct input on scope and priorities before GA.

Join the waitlist See our AI approach