How to Use AI (LLM) Extraction

Describe what you want to extract in plain English and let the AI figure out the rest. Works on any page, including JavaScript-heavy and unstructured content.

What is it?

AI extraction lets you write a plain English instruction describing what data to pull from a page. Verid sends the page content and your prompt to a fast, JSON-capable LLM (with an automatic fallback model if the primary fails), which reads the content and returns structured JSON.

No selectors. No regex patterns. No inspecting the DOM. Just describe what you want.

When to use it

The page is rendered by JavaScript and the content doesn't exist in the raw HTML (CSS/XPath won't work)
The data is written in natural language - descriptions, summaries, prose - not a labelled field
The page structure varies or changes over time and you need something resilient
You want to extract something complex: "the names and prices of all featured products" or "what the current system status is in one word"
You need to interpret or classify content, not just extract a literal string

Limitations

Slightly slower than other methods (LLM inference takes a few extra seconds)
Results are cached for 30 days - identical page content + prompt won't re-run the LLM
Page content is truncated at 50,000 characters before being sent to the model
Requires clearer prompts when the page has a lot of irrelevant content

How to configure it

Pick AI / LLM Extraction as your method and write a prompt:

{
  "method": "prompt",
  "prompt": "Extract the current price and availability status of the product. Return JSON with keys: price, availability."
}

You can optionally add a schema to enforce the output structure:

{
  "method": "prompt",
  "prompt": "Extract the job title, company, location, and salary if listed.",
  "schema": {
    "title": "string",
    "company": "string",
    "location": "string",
    "salary": "string or null"
  }
}

Prompt requirements

Minimum 10 characters, maximum 2,000 characters
Be specific about what keys you want in the output
Mention how to handle missing data (e.g., "use null if not listed")

Example 1 - Extract a job listing

Goal: Track the details of a job posting on a site with no clean DOM structure.

URL: https://jobs.example.com/position/sr-engineer-backend

Page content (prose):

We're looking for a Senior Backend Engineer to join our platform team.
This is a fully remote role. Salary range: $160,000–$200,000 USD.
You'll work primarily with Go and Kubernetes.

Configuration:

{
  "method": "prompt",
  "prompt": "Extract the job title, whether the role is remote or on-site, the salary range if listed, and the main technologies mentioned. Return JSON with keys: title, remote, salary, technologies.",
  "schema": {
    "title": "string",
    "remote": "boolean",
    "salary": "string or null",
    "technologies": "array of strings"
  }
}

Output:

{
  "title": "Senior Backend Engineer",
  "remote": true,
  "salary": "$160,000–$200,000 USD",
  "technologies": ["Go", "Kubernetes"]
}

Example 2 - Summarize a system status page

Goal: Get a concise machine-readable status summary from a status page written in prose.

Page content:

All Systems Operational
API: Operational
Dashboard: Degraded Performance - we are investigating elevated response times
Database: Operational
CDN: Operational

Configuration:

{
  "method": "prompt",
  "prompt": "Read the system status page. Return JSON with an overall status (one of: operational, degraded, outage) and an array of services where each service has a name and status field."
}

Output:

{
  "overall": "degraded",
  "services": [
    { "name": "API", "status": "operational" },
    { "name": "Dashboard", "status": "degraded" },
    { "name": "Database", "status": "operational" },
    { "name": "CDN", "status": "operational" }
  ]
}

Example 3 - Extract featured products from a homepage

Goal: Track which products are currently featured on an e-commerce homepage.

Configuration:

{
  "method": "prompt",
  "prompt": "Find all featured or promoted products on this page. For each product return its name, price, and whether it's marked as on sale. Return an array under the key 'products'.",
  "schema": {
    "products": "array of {name: string, price: string, on_sale: boolean}"
  }
}

Output:

{
  "products": [
    { "name": "Wireless Headphones Pro", "price": "$79.99", "on_sale": true },
    { "name": "USB-C Hub 7-in-1", "price": "$34.99", "on_sale": false },
    { "name": "Mechanical Keyboard TKL", "price": "$129.00", "on_sale": false }
  ]
}

Example 4 - Detect sentiment or intent

Goal: Monitor a company's blog for posts that signal pricing or feature changes.

Configuration:

{
  "method": "prompt",
  "prompt": "Read this blog post and return: 1) whether it announces a pricing change (true/false), 2) whether it announces a new feature (true/false), and 3) a one-sentence summary. Keys: pricing_change, feature_announcement, summary."
}

Output:

{
  "pricing_change": false,
  "feature_announcement": true,
  "summary": "The post announces a new AI-powered search feature launching in June."
}

Example 5 - Handle JavaScript-rendered content

Goal: Extract data from a React or Vue SPA where CSS selectors return nothing because the HTML is empty on load.

When a page is fully client-side rendered, the raw HTML might look like:

<div id="root"></div>

The AI extractor works by reading the page content after rendering - so it can still find and interpret the visible text even when CSS/XPath/Regex would all fail.

Configuration:

{
  "method": "prompt",
  "prompt": "Find the product name, price, and stock level on this page. Return JSON with keys: product, price, stock."
}

Writing good prompts

Be specific about output keys. Don't say "extract the price" - say "return JSON with a key called price."

❌ "Tell me about the product"
✅ "Extract the product name and price. Return JSON with keys: name, price."

Handle missing data explicitly:

"If salary is not listed, use null for the salary field."

Use the schema field for consistency. When you define a schema, the AI validates its output against it and retries if the structure is wrong.

Describe the context. The AI reads the whole page - telling it where to look helps:

"In the pricing table on this page, find the Pro plan's monthly price."

Tips

Results are cached - if you test with the same URL and prompt multiple times while the page hasn't changed, you'll get the cached response. Change the prompt slightly to force a fresh run during development.
Use specific keys in your prompt. The AI will use them exactly as named, making downstream processing predictable.
Combine with other methods. Use Full Page Hash to detect a change, then AI extraction to understand what changed - though Verid's built-in change detection on any extraction method's fields handles this automatically.
Prompt length. Keep prompts under 500 characters when possible - concise prompts get more consistent results than long, elaborate ones.

How to Use AI (LLM) Extraction

What is it?

When to use it

Limitations

How to configure it

Prompt requirements

Example 1 - Extract a job listing

Example 2 - Summarize a system status page

Example 3 - Extract featured products from a homepage

Example 4 - Detect sentiment or intent

Example 5 - Handle JavaScript-rendered content

Writing good prompts

Tips

More guides

Try Verid for free