← All guides
Regexextraction

How to Use Regex Extraction

Run regular expressions against raw page source to extract versioned strings, counts, prices, or any pattern that appears in the HTML or text.

Verid Guides·5 min read

What is it?

Regex extraction runs your regular expression patterns against the full raw source of the page - including all HTML markup, script tags, and embedded data. It's the most flexible text-matching method and works even when there's no clean DOM structure to target.

Use it when the value you need is buried in page source, injected by JavaScript as a string literal, or follows a predictable text pattern like a version number or price format.

When to use it

  • Extracting a version number embedded in a script tag (e.g., window.__APP_VERSION__ = "2.4.1")
  • Counting occurrences of a pattern (e.g., how many <loc> tags are in a sitemap)
  • Pulling a formatted value like a date or price from mixed HTML/text content
  • The page has no clean DOM structure to target with CSS or XPath

How to configure it

Pick Regex as your extraction method, then map field names to regex patterns:

{
  "method": "regex",
  "fields": {
    "version": "/(\\d+\\.\\d+\\.\\d+)/",
    "url_count": "<loc>"
  }
}

Two modes

1. Pattern with capture groups → extracts the first capture group's value

Wrap a pattern in /…/ with optional flags. The value of the first (…) capture group is returned.

{ "version": "/(\\d+\\.\\d+\\.\\d+)/" }

Returns: "1.2.3"

2. Plain string pattern → counts occurrences

A plain string (no /…/ wrapping) is counted across the full page source.

{ "url_count": "<loc>" }

Returns: 42 (number of times <loc> appears)


Example 1 - Extract a version number from a script tag

Goal: Track the app version injected into the page as a JavaScript variable.

Page source (fragment):

<script>
  window.__CONFIG__ = {
    version: "3.14.2",
    env: "production"
  };
</script>

Configuration:

{
  "method": "regex",
  "fields": {
    "version": "/version: \"(\\d+\\.\\d+\\.\\d+)\"/"
  }
}

Output:

{
  "version": "3.14.2"
}

The capture group (\\d+\\.\\d+\\.\\d+) isolates just the version number from the surrounding text.


Example 2 - Count sitemap URLs

Goal: Know when a sitemap grows or shrinks (new pages added or removed).

URL: https://example.com/sitemap.xml

File contents (fragment):

<urlset>
  <url><loc>https://example.com/</loc></url>
  <url><loc>https://example.com/about</loc></url>
  <url><loc>https://example.com/pricing</loc></url>
</urlset>

Configuration:

{
  "method": "regex",
  "fields": {
    "url_count": "<loc>"
  }
}

Output:

{
  "url_count": 3
}

No capture group - the plain string counts occurrences.


Example 3 - Extract a price from mixed content

Goal: Pull a price from a page where it's written inline in a paragraph with no dedicated element.

Page HTML:

<p>
  The annual plan is currently priced at <strong>$149.00</strong> per seat,
  billed yearly.
</p>

Configuration:

{
  "method": "regex",
  "fields": {
    "annual_price": "/\\$([\\d,]+\\.\\d{2})/"
  }
}

Output:

{
  "annual_price": "149.00"
}

The $ sign is matched but not captured - only the digits inside (…) are returned.


Example 4 - Extract a date from a rendered string

Goal: Track the "Last updated" date on a policy or terms page.

Page HTML:

<footer class="doc-meta">
  Last updated: March 15, 2026
</footer>

Configuration:

{
  "method": "regex",
  "fields": {
    "last_updated": "/Last updated: ([A-Za-z]+ \\d{1,2}, \\d{4})/"
  }
}

Output:

{
  "last_updated": "March 15, 2026"
}

Example 5 - Count keywords or links

Goal: Track how many external links appear on a page.

Configuration:

{
  "method": "regex",
  "fields": {
    "external_links": "href=\"https://"
  }
}

Output:

{
  "external_links": 17
}

Plain string, no capture group - returns a count.


Regex pattern reference

Pattern What it matches
\\d Any digit (0–9)
\\d+ One or more digits
\\d+\\.\\d+\\.\\d+ Semantic version like 1.2.3
[A-Za-z]+ One or more letters
\\$[\\d,]+\\.\\d{2} A price like $1,999.00
(…) Capture group - what gets returned
/pattern/i Case-insensitive match
/pattern/g Global match (used for counting without capture groups)

Tips

  • Double-escape backslashes in JSON. Regex \d becomes \\d inside a JSON string. Write \\d+\\.\\d+ not \d+\.\d+.
  • Use capture groups to isolate the value. Without a capture group the field returns a count; with one, it returns the matched text.
  • Test your regex before saving. Paste a fragment of the page source into regex101.com and verify the match.
  • Regex runs on raw source. It sees the full HTML including tags, attributes, and script contents - which makes it great for embedded data, but your pattern needs to be specific enough to avoid accidental matches.

Common issues

Problem Fix
Returns a count instead of text Add a capture group (…) around the part you want
Returns 0 The pattern doesn't match - check the page source and adjust
Getting the wrong match Make the surrounding context more specific so only one match exists
Backslash errors Double-escape in JSON: \\d not \d

Try Verid for free

5 monitors, no credit card required.

Get started free