What this page covers (and why it matters)
Most “PHP scraping” examples work on a single clean page. Real web crawling is different: you’ll see malformed HTML, inconsistent URLs, anti-bot defenses, and content that moves into JSON payloads or JSON-LD. The techniques below focus on repeatable crawling — code patterns that reduce breakage over time.
A simple crawler pipeline (mental model)
Download
Fetch HTML/JSON with cURL (headers, cookies, timeouts, retries).
Parse
DOMDocument/XPath for HTML, json_decode for APIs and JSON-LD.
Extract + normalize
Turn pages into stable fields (absolute URLs, canonical IDs, clean text).
Store + monitor
Save structured outputs, detect drift, alert on failures and anomalies.
If you need a downloader function, see: Downloading a Webpage Using PHP and cURL.
Extract links from a page in PHP (DOMDocument + safer parsing)
DOMDocument is a solid baseline for HTML parsing in PHP. The main “gotcha” is that real-world HTML is often invalid. Use libxml internal errors so broken markup doesn’t explode your crawl.
// Assuming your HTML is in $contents and the page URL is in $pageUrl
libxml_use_internal_errors(true);
$document = new DOMDocument();
$document->loadHTML($contents, LIBXML_NOWARNING | LIBXML_NOERROR);
$links = $document->getElementsByTagName('a');
foreach ($links as $node) {
$href = trim($node->getAttribute('href'));
if ($href === '') { continue; }
$text = trim($node->textContent);
// OPTIONAL: normalize to absolute URLs + strip fragments
// $absolute = normalizeUrl($href, $pageUrl);
// $absolute = preg_replace('/#.*$/', '', $absolute);
// Use $href / $text (or $absolute) as needed
}
libxml_clear_errors();
Normalize URLs (relative → absolute, dedupe, clean)
URL normalization is one of the easiest ways to improve crawl quality. It reduces duplicate downloads and prevents missing pages due to relative paths. This helper resolves relative links against the current page URL.
function normalizeUrl($maybeRelative, $baseUrl) {
$maybeRelative = trim($maybeRelative);
// Ignore anchors, javascript:, mailto:, tel:
if ($maybeRelative === '' ||
str_starts_with($maybeRelative, '#') ||
preg_match('/^(javascript:|mailto:|tel:)/i', $maybeRelative)) {
return null;
}
// If it's already absolute, return it (minus fragment)
if (preg_match('/^https?:\/\//i', $maybeRelative)) {
return preg_replace('/#.*$/', '', $maybeRelative);
}
$base = parse_url($baseUrl);
if (!$base || empty($base['scheme']) || empty($base['host'])) { return null; }
$scheme = $base['scheme'];
$host = $base['host'];
$port = isset($base['port']) ? ':' . $base['port'] : '';
$path = isset($base['path']) ? $base['path'] : '/';
// Build base directory (strip filename)
$dir = preg_replace('/\/[^\/]*$/', '/', $path);
// Handle protocol-relative URLs: //cdn.example.com/x
if (str_starts_with($maybeRelative, '//')) {
return preg_replace('/#.*$/', '', $scheme . ':' . $maybeRelative);
}
// Root-relative: /x/y
if (str_starts_with($maybeRelative, '/')) {
$abs = $scheme . '://' . $host . $port . $maybeRelative;
return preg_replace('/#.*$/', '', $abs);
}
// Path-relative: x/y
$abs = $scheme . '://' . $host . $port . $dir . $maybeRelative;
// Clean /./ and /../ segments (simple normalization)
$abs = preg_replace('#/\.?/#', '/', $abs);
while (preg_match('#/(?!\.\.)[^/]+/\.\./#', $abs)) {
$abs = preg_replace('#/(?!\.\.)[^/]+/\.\./#', '/', $abs);
}
return preg_replace('/#.*$/', '', $abs);
}
- Deduplicate: store normalized URLs in a set (hash map) before enqueueing.
- Canonicalize: optionally strip tracking parameters (utm_*) if they cause duplicates.
- Scope control: restrict to a domain or path prefix to prevent “infinite crawl.”
Extract images from a webpage in PHP
For images, you usually care about src, alt, and sometimes srcset (responsive images).
Like links, image URLs often need normalization.
// Assuming your HTML is in $contents and current URL is $pageUrl
libxml_use_internal_errors(true);
$document = new DOMDocument();
$document->loadHTML($contents, LIBXML_NOWARNING | LIBXML_NOERROR);
$imgs = $document->getElementsByTagName('img');
foreach ($imgs as $node) {
$src = trim($node->getAttribute('src'));
$alt = trim($node->getAttribute('alt'));
$srcset = trim($node->getAttribute('srcset'));
// OPTIONAL: normalize URL
// $srcAbs = normalizeUrl($src, $pageUrl);
// Store or process: $src / $alt / $srcset
}
libxml_clear_errors();
src is empty, check attributes like
data-src, data-original, or data-lazy.
Extract JSON from a webpage (use JSON-LD first)
If you’re crawling e-commerce pages, job pages, or listings, the cleanest data is often embedded as JSON-LD:
<script type="application/ld+json">. Parse that first. Regex should be a fallback, not your primary method.
// Returns an array of decoded JSON-LD objects found in the HTML
function extractJsonLd($contents) {
libxml_use_internal_errors(true);
$document = new DOMDocument();
$document->loadHTML($contents, LIBXML_NOWARNING | LIBXML_NOERROR);
$out = [];
$scripts = $document->getElementsByTagName('script');
foreach ($scripts as $s) {
$type = strtolower(trim($s->getAttribute('type')));
if ($type !== 'application/ld+json') { continue; }
$raw = trim($s->textContent);
if ($raw === '') { continue; }
$decoded = json_decode($raw, true);
if (json_last_error() === JSON_ERROR_NONE) {
$out[] = $decoded;
}
}
libxml_clear_errors();
return $out;
}
// Regex-based JSON extraction is fragile; prefer JSON-LD when available.
function extractJsonFallback($contents) {
$matches = [];
preg_match_all('/\{(?:[^{}]|(?R))*\}/', $contents, $matches);
$valid = [];
foreach ($matches[0] as $candidate) {
$decoded = json_decode($candidate, true);
if (json_last_error() === JSON_ERROR_NONE) {
$valid[] = $decoded;
}
}
return $valid;
}
Avoid getting blocked (the responsible ladder)
If your PHP crawler is getting blocked, start by assuming the site is protecting itself from high request volume or “bot-like” behavior. You’ll get farther by being polite and consistent than by immediately rotating infrastructure.
Rate limit + backoff
Slow down. Add random jitter. Use exponential backoff on 429/503 responses.
Send realistic headers
Set User-Agent, Accept, Accept-Language. Maintain cookies if the site expects sessions.
Cache and avoid re-downloading
Store responses. Use ETags/If-Modified-Since when possible to reduce load.
Robots + scope control
Check robots.txt and keep your frontier bounded (domain/path rules) to prevent accidental hammering.
Infrastructure choices (when permitted)
For recurring crawls, a stable IP strategy can improve reliability and debugging more than constant rotation.
// If you are permitted to use a proxy for your use case:
curl_setopt($handle, CURLOPT_PROXY, $proxyIpAddress);
curl_setopt($handle, CURLOPT_PROXYPORT, $proxyPort);
Questions About PHP Web Crawlers
Common questions we see when teams build PHP web crawlers for link discovery, e-commerce extraction, monitoring, and recurring data collection.
What’s the best way to extract data in a PHP web crawler? +
For HTML pages, start with DOMDocument and (when needed) XPath selectors. For structured fields on modern sites, look for JSON-LD or API endpoints that return clean JSON. The “best” method depends on how stable the page is.
Why do my DOMDocument extractions break on real pages? +
Real-world HTML is frequently invalid. Enable libxml internal errors and treat parsing as “best effort.” If you need precision, add XPath rules and fallback selectors rather than relying on a single brittle pattern.
How do I handle relative links when crawling a site? +
Normalize every discovered URL against the current page URL (relative → absolute), remove fragments, and deduplicate before enqueueing. Without normalization, crawlers miss pages and waste time re-downloading duplicates.
What should I do if my PHP crawler gets blocked? +
Start with politeness and stability: rate limiting, caching, backoff on errors, realistic headers, and session handling. Proxies are sometimes useful, but they’re not the first or best fix for most durable crawlers.
Can Potent Pages build and maintain this as a managed pipeline? +
Yes. We build production crawlers that run on schedules, monitor for breakage, and deliver structured outputs (CSV, database exports, API feeds, dashboards).
Want this to run in production (monitoring included)?
If your PHP crawler is feeding a business process, research workflow, or recurring monitoring job, consider designing it like infrastructure: stable schemas, quality checks, and alerts when extraction breaks.
