This is a teaser of an upcoming integration. Join the waitlist to get early access and shape the direction.
What you get
Raw HTML typically carries 3-5x more tokens than the readable content: navigation, scripts, inline styles, tracking. The API will strip boilerplate and emit semantic markdown that preserves headings, lists, and tables.
Each page will return URL, fetch timestamp, language, and a content hash over normalized markdown. Target behaviour: skip unchanged pages on re-crawl and avoid re-embedding work you already did.
Define the fields you care about (title, section, body, price) and the API will return them as discrete records with stable ids. Chunk with your preferred splitter; the service is designed not to force a chunking strategy on you.
Planned routing: each URL will try the cheapest provider that works across Firecrawl, Jina, Brightdata, Zyte, Scrapingbee, Oxylabs, ScraperAPI, and Apify, with automatic failover. One API, one bill, no per-provider rewrites when one breaks.
Pipeline
Web pages -> webscraping.app -> clean markdown + records + metadata -> embeddings (OpenAI, Cohere, Jina, Voyage) -> vector DB (Pinecone, Qdrant, pgvector) -> RAG query engine.