How to Use Regex Extraction
Run regular expressions against raw page source to extract versioned strings, counts, prices, or any pattern that appears in the HTML or text.
What is it?
Regex extraction runs your regular expression patterns against the full raw source of the page - including all HTML markup, script tags, and embedded data. It's the most flexible text-matching method and works even when there's no clean DOM structure to target.
Use it when the value you need is buried in page source, injected by JavaScript as a string literal, or follows a predictable text pattern like a version number or price format.
When to use it
- Extracting a version number embedded in a script tag (e.g.,
window.__APP_VERSION__ = "2.4.1") - Counting occurrences of a pattern (e.g., how many
<loc>tags are in a sitemap) - Pulling a formatted value like a date or price from mixed HTML/text content
- The page has no clean DOM structure to target with CSS or XPath
How to configure it
Pick Regex as your extraction method, then map field names to regex patterns:
{
"method": "regex",
"fields": {
"version": "/(\\d+\\.\\d+\\.\\d+)/",
"url_count": "<loc>"
}
}
Two modes
1. Pattern with capture groups → extracts the first capture group's value
Wrap a pattern in /…/ with optional flags. The value of the first (…) capture group is returned.
{ "version": "/(\\d+\\.\\d+\\.\\d+)/" }
Returns: "1.2.3"
2. Plain string pattern → counts occurrences
A plain string (no /…/ wrapping) is counted across the full page source.
{ "url_count": "<loc>" }
Returns: 42 (number of times <loc> appears)
Example 1 - Extract a version number from a script tag
Goal: Track the app version injected into the page as a JavaScript variable.
Page source (fragment):
<script>
window.__CONFIG__ = {
version: "3.14.2",
env: "production"
};
</script>
Configuration:
{
"method": "regex",
"fields": {
"version": "/version: \"(\\d+\\.\\d+\\.\\d+)\"/"
}
}
Output:
{
"version": "3.14.2"
}
The capture group (\\d+\\.\\d+\\.\\d+) isolates just the version number from the surrounding text.
Example 2 - Count sitemap URLs
Goal: Know when a sitemap grows or shrinks (new pages added or removed).
URL: https://example.com/sitemap.xml
File contents (fragment):
<urlset>
<url><loc>https://example.com/</loc></url>
<url><loc>https://example.com/about</loc></url>
<url><loc>https://example.com/pricing</loc></url>
</urlset>
Configuration:
{
"method": "regex",
"fields": {
"url_count": "<loc>"
}
}
Output:
{
"url_count": 3
}
No capture group - the plain string counts occurrences.
Example 3 - Extract a price from mixed content
Goal: Pull a price from a page where it's written inline in a paragraph with no dedicated element.
Page HTML:
<p>
The annual plan is currently priced at <strong>$149.00</strong> per seat,
billed yearly.
</p>
Configuration:
{
"method": "regex",
"fields": {
"annual_price": "/\\$([\\d,]+\\.\\d{2})/"
}
}
Output:
{
"annual_price": "149.00"
}
The $ sign is matched but not captured - only the digits inside (…) are returned.
Example 4 - Extract a date from a rendered string
Goal: Track the "Last updated" date on a policy or terms page.
Page HTML:
<footer class="doc-meta">
Last updated: March 15, 2026
</footer>
Configuration:
{
"method": "regex",
"fields": {
"last_updated": "/Last updated: ([A-Za-z]+ \\d{1,2}, \\d{4})/"
}
}
Output:
{
"last_updated": "March 15, 2026"
}
Example 5 - Count keywords or links
Goal: Track how many external links appear on a page.
Configuration:
{
"method": "regex",
"fields": {
"external_links": "href=\"https://"
}
}
Output:
{
"external_links": 17
}
Plain string, no capture group - returns a count.
Regex pattern reference
| Pattern | What it matches |
|---|---|
\\d |
Any digit (0–9) |
\\d+ |
One or more digits |
\\d+\\.\\d+\\.\\d+ |
Semantic version like 1.2.3 |
[A-Za-z]+ |
One or more letters |
\\$[\\d,]+\\.\\d{2} |
A price like $1,999.00 |
(…) |
Capture group - what gets returned |
/pattern/i |
Case-insensitive match |
/pattern/g |
Global match (used for counting without capture groups) |
Tips
- Double-escape backslashes in JSON. Regex
\dbecomes\\dinside a JSON string. Write\\d+\\.\\d+not\d+\.\d+. - Use capture groups to isolate the value. Without a capture group the field returns a count; with one, it returns the matched text.
- Test your regex before saving. Paste a fragment of the page source into regex101.com and verify the match.
- Regex runs on raw source. It sees the full HTML including tags, attributes, and script contents - which makes it great for embedded data, but your pattern needs to be specific enough to avoid accidental matches.
Common issues
| Problem | Fix |
|---|---|
| Returns a count instead of text | Add a capture group (…) around the part you want |
Returns 0 |
The pattern doesn't match - check the page source and adjust |
| Getting the wrong match | Make the surrounding context more specific so only one match exists |
| Backslash errors | Double-escape in JSON: \\d not \d |
More guides