What this page covers (and why it matters)

Most “PHP scraping” examples work on a single clean page. Real web crawling is different: you’ll see malformed HTML, inconsistent URLs, anti-bot defenses, and content that moves into JSON payloads or JSON-LD. The techniques below focus on repeatable crawling — code patterns that reduce breakage over time.

Reminder: Only crawl pages you have permission to access and always respect site policies and robots rules. “Avoid getting blocked” should start with being polite (rate limiting, caching, and correct headers) — not trying to overpower a site.

A simple crawler pipeline (mental model)

1

Download

Fetch HTML/JSON with cURL (headers, cookies, timeouts, retries).

2

Parse

DOMDocument/XPath for HTML, json_decode for APIs and JSON-LD.

3

Extract + normalize

Turn pages into stable fields (absolute URLs, canonical IDs, clean text).

4

Store + monitor

Save structured outputs, detect drift, alert on failures and anomalies.

If you need a downloader function, see: Downloading a Webpage Using PHP and cURL.

Extract links from a page in PHP (DOMDocument + safer parsing)

DOMDocument is a solid baseline for HTML parsing in PHP. The main “gotcha” is that real-world HTML is often invalid. Use libxml internal errors so broken markup doesn’t explode your crawl.

PHP: Extract links + anchor text

// Assuming your HTML is in $contents and the page URL is in $pageUrl
libxml_use_internal_errors(true);

$document = new DOMDocument();
$document->loadHTML($contents, LIBXML_NOWARNING | LIBXML_NOERROR);

$links = $document->getElementsByTagName('a');

foreach ($links as $node) {
    $href = trim($node->getAttribute('href'));
    if ($href === '') { continue; }

    $text = trim($node->textContent);

    // OPTIONAL: normalize to absolute URLs + strip fragments
    // $absolute = normalizeUrl($href, $pageUrl);
    // $absolute = preg_replace('/#.*$/', '', $absolute);

    // Use $href / $text (or $absolute) as needed
}

libxml_clear_errors();

Technique upgrade: In crawlers, you almost always want absolute, de-duplicated URLs. The next section gives a drop-in URL normalizer so “/product/123” doesn’t break your frontier.

Normalize URLs (relative → absolute, dedupe, clean)

URL normalization is one of the easiest ways to improve crawl quality. It reduces duplicate downloads and prevents missing pages due to relative paths. This helper resolves relative links against the current page URL.

PHP: Normalize a URL against a base URL

function normalizeUrl($maybeRelative, $baseUrl) {
    $maybeRelative = trim($maybeRelative);

    // Ignore anchors, javascript:, mailto:, tel:
    if ($maybeRelative === '' ||
        str_starts_with($maybeRelative, '#') ||
        preg_match('/^(javascript:|mailto:|tel:)/i', $maybeRelative)) {
        return null;
    }

    // If it's already absolute, return it (minus fragment)
    if (preg_match('/^https?:\/\//i', $maybeRelative)) {
        return preg_replace('/#.*$/', '', $maybeRelative);
    }

    $base = parse_url($baseUrl);
    if (!$base || empty($base['scheme']) || empty($base['host'])) { return null; }

    $scheme = $base['scheme'];
    $host   = $base['host'];
    $port   = isset($base['port']) ? ':' . $base['port'] : '';
    $path   = isset($base['path']) ? $base['path'] : '/';

    // Build base directory (strip filename)
    $dir = preg_replace('/\/[^\/]*$/', '/', $path);

    // Handle protocol-relative URLs: //cdn.example.com/x
    if (str_starts_with($maybeRelative, '//')) {
        return preg_replace('/#.*$/', '', $scheme . ':' . $maybeRelative);
    }

    // Root-relative: /x/y
    if (str_starts_with($maybeRelative, '/')) {
        $abs = $scheme . '://' . $host . $port . $maybeRelative;
        return preg_replace('/#.*$/', '', $abs);
    }

    // Path-relative: x/y
    $abs = $scheme . '://' . $host . $port . $dir . $maybeRelative;

    // Clean /./ and /../ segments (simple normalization)
    $abs = preg_replace('#/\.?/#', '/', $abs);
    while (preg_match('#/(?!\.\.)[^/]+/\.\./#', $abs)) {
        $abs = preg_replace('#/(?!\.\.)[^/]+/\.\./#', '/', $abs);
    }

    return preg_replace('/#.*$/', '', $abs);
}

Deduplicate: store normalized URLs in a set (hash map) before enqueueing.
Canonicalize: optionally strip tracking parameters (utm_*) if they cause duplicates.
Scope control: restrict to a domain or path prefix to prevent “infinite crawl.”

Extract images from a webpage in PHP

For images, you usually care about src, alt, and sometimes srcset (responsive images). Like links, image URLs often need normalization.

PHP: Extract images (src, alt, srcset)

// Assuming your HTML is in $contents and current URL is $pageUrl
libxml_use_internal_errors(true);

$document = new DOMDocument();
$document->loadHTML($contents, LIBXML_NOWARNING | LIBXML_NOERROR);

$imgs = $document->getElementsByTagName('img');

foreach ($imgs as $node) {
    $src    = trim($node->getAttribute('src'));
    $alt    = trim($node->getAttribute('alt'));
    $srcset = trim($node->getAttribute('srcset'));

    // OPTIONAL: normalize URL
    // $srcAbs = normalizeUrl($src, $pageUrl);

    // Store or process: $src / $alt / $srcset
}

libxml_clear_errors();

Tip: Many sites lazy-load images. If src is empty, check attributes like data-src, data-original, or data-lazy.

Extract JSON from a webpage (use JSON-LD first)

If you’re crawling e-commerce pages, job pages, or listings, the cleanest data is often embedded as JSON-LD: <script type="application/ld+json">. Parse that first. Regex should be a fallback, not your primary method.

PHP: Extract JSON-LD blocks (recommended)

// Returns an array of decoded JSON-LD objects found in the HTML
function extractJsonLd($contents) {
    libxml_use_internal_errors(true);

    $document = new DOMDocument();
    $document->loadHTML($contents, LIBXML_NOWARNING | LIBXML_NOERROR);

    $out = [];
    $scripts = $document->getElementsByTagName('script');

    foreach ($scripts as $s) {
        $type = strtolower(trim($s->getAttribute('type')));
        if ($type !== 'application/ld+json') { continue; }

        $raw = trim($s->textContent);
        if ($raw === '') { continue; }

        $decoded = json_decode($raw, true);
        if (json_last_error() === JSON_ERROR_NONE) {
            $out[] = $decoded;
        }
    }

    libxml_clear_errors();
    return $out;
}

PHP: Regex JSON fallback (use carefully)

// Regex-based JSON extraction is fragile; prefer JSON-LD when available.
function extractJsonFallback($contents) {
    $matches = [];
    preg_match_all('/\{(?:[^{}]|(?R))*\}/', $contents, $matches);

    $valid = [];
    foreach ($matches[0] as $candidate) {
        $decoded = json_decode($candidate, true);
        if (json_last_error() === JSON_ERROR_NONE) {
            $valid[] = $decoded;
        }
    }
    return $valid;
}

Practical strategy: In production, you’ll get better results by targeting known JSON containers: JSON-LD, “window.__STATE__”, “__NEXT_DATA__”, or documented API endpoints — rather than scanning entire HTML with regex.

Avoid getting blocked (the responsible ladder)

If your PHP crawler is getting blocked, start by assuming the site is protecting itself from high request volume or “bot-like” behavior. You’ll get farther by being polite and consistent than by immediately rotating infrastructure.

1

Rate limit + backoff

Slow down. Add random jitter. Use exponential backoff on 429/503 responses.

2

Send realistic headers

Set User-Agent, Accept, Accept-Language. Maintain cookies if the site expects sessions.

3

Cache and avoid re-downloading

Store responses. Use ETags/If-Modified-Since when possible to reduce load.

4

Robots + scope control

Check robots.txt and keep your frontier bounded (domain/path rules) to prevent accidental hammering.

5

Infrastructure choices (when permitted)

For recurring crawls, a stable IP strategy can improve reliability and debugging more than constant rotation.

PHP: Proxy settings in cURL (only when appropriate)

// If you are permitted to use a proxy for your use case:
curl_setopt($handle, CURLOPT_PROXY, $proxyIpAddress);
curl_setopt($handle, CURLOPT_PROXYPORT, $proxyPort);

Important: Proxies aren’t a magic fix. The best crawlers reduce blocks by behaving well (rate limiting, caching, sessions) and by engineering for durability.

Questions About PHP Web Crawlers

Common questions we see when teams build PHP web crawlers for link discovery, e-commerce extraction, monitoring, and recurring data collection.

What’s the best way to extract data in a PHP web crawler? +

For HTML pages, start with DOMDocument and (when needed) XPath selectors. For structured fields on modern sites, look for JSON-LD or API endpoints that return clean JSON. The “best” method depends on how stable the page is.

Why do my DOMDocument extractions break on real pages? +

Real-world HTML is frequently invalid. Enable libxml internal errors and treat parsing as “best effort.” If you need precision, add XPath rules and fallback selectors rather than relying on a single brittle pattern.

How do I handle relative links when crawling a site? +

Normalize every discovered URL against the current page URL (relative → absolute), remove fragments, and deduplicate before enqueueing. Without normalization, crawlers miss pages and waste time re-downloading duplicates.

What should I do if my PHP crawler gets blocked? +

Start with politeness and stability: rate limiting, caching, backoff on errors, realistic headers, and session handling. Proxies are sometimes useful, but they’re not the first or best fix for most durable crawlers.

Rule of thumb: If you can’t crawl slowly and reliably, you can’t crawl fast reliably either.

Can Potent Pages build and maintain this as a managed pipeline? +

Yes. We build production crawlers that run on schedules, monitor for breakage, and deliver structured outputs (CSV, database exports, API feeds, dashboards).

Discuss your crawler → View services ↗

Want this to run in production (monitoring included)?

If your PHP crawler is feeding a business process, research workflow, or recurring monitoring job, consider designing it like infrastructure: stable schemas, quality checks, and alerts when extraction breaks.

Next step: Send us the target sites, the fields you need, and how often you need updates — we’ll help you scope the fastest path to a durable pipeline.

Contact Potent Pages → Crawler services ↗

PHP WEB CRAWLER
Development Techniques That Hold Up in Real Crawls

What this page covers (and why it matters)

A simple crawler pipeline (mental model)

Download

Parse

Extract + normalize

Store + monitor

Extract links from a page in PHP (DOMDocument + safer parsing)

Normalize URLs (relative → absolute, dedupe, clean)

Extract images from a webpage in PHP

Extract JSON from a webpage (use JSON-LD first)

Avoid getting blocked (the responsible ladder)

Rate limit + backoff

Send realistic headers

Cache and avoid re-downloading

Robots + scope control

Infrastructure choices (when permitted)

Questions About PHP Web Crawlers

Want this to run in production (monitoring included)?

Web Crawlers

Data Collection

Development

Web Crawler Industries

Building Your Own

Legality of Web Crawlers

GPT & Web Crawlers

PHP WEB CRAWLER Development Techniques That Hold Up in Real Crawls

What this page covers (and why it matters)

A simple crawler pipeline (mental model)

Download

Parse

Extract + normalize

Store + monitor

Extract links from a page in PHP (DOMDocument + safer parsing)

Normalize URLs (relative → absolute, dedupe, clean)

Extract images from a webpage in PHP

Extract JSON from a webpage (use JSON-LD first)

Avoid getting blocked (the responsible ladder)

Rate limit + backoff

Send realistic headers

Cache and avoid re-downloading

Robots + scope control

Infrastructure choices (when permitted)

Questions About PHP Web Crawlers

Want this to run in production (monitoring included)?

Web Crawlers

Data Collection

Development

Web Crawler Industries

Building Your Own

Legality of Web Crawlers

GPT & Web Crawlers

PHP WEB CRAWLER
Development Techniques That Hold Up in Real Crawls