Lightfeed Extractor
Robust Web Data Extractor Using LLMs and Browser Automation
Overview
Lightfeed Extractor is a Typescript library built for robust web data extraction using LLMs and Playwright. Use natural language prompts to navigate web pages and extract structured data. Get complete, accurate results with great token efficiency โ critical for production data pipelines.
Features
๐ค Browser Automation in Stealth Mode - Launch Playwright browsers locally, in serverless clouds, or connect to a remote browser server. Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
๐งญ AI Browser Navigation - Pair with @lightfeed/browser-agent to navigate pages using natural language commands before extracting structured data.
๐งน LLM-ready Markdown - Convert HTML to LLM-ready markdown, with options to extract only main content and clean URLs by removing tracking parameters.
โก๏ธ LLM Extraction - Use LLMs in JSON mode to extract structured data according to input Zod schema. Token usage limit and tracking included.
๐ ๏ธ JSON Recovery - Sanitize and recover failed JSON output. This makes complex schema extraction much more robust, especially with deeply nested objects and arrays.
... continue reading