Give us a call: (800) 252-6164
PHP · Web Crawler Development · Techniques

PHP WEB CRAWLER
Development Techniques That Hold Up in Real Crawls

If you’re building a PHP web crawler, the hard part isn’t “extract a tag once” — it’s extracting reliably across messy HTML, changing layouts, relative URLs, and long-running crawl schedules. This guide collects practical techniques we use to build sturdier crawlers: link discovery, image extraction, JSON-LD parsing, URL normalization, and block-avoidance the responsible way.

  • Extract links + images safely
  • Parse JSON-LD for product data
  • Normalize URLs & dedupe
  • Reduce breakage & blocks

What this page covers (and why it matters)

Most “PHP scraping” examples work on a single clean page. Real web crawling is different: you’ll see malformed HTML, inconsistent URLs, anti-bot defenses, and content that moves into JSON payloads or JSON-LD. The techniques below focus on repeatable crawling — code patterns that reduce breakage over time.

Reminder: Only crawl pages you have permission to access and always respect site policies and robots rules. “Avoid getting blocked” should start with being polite (rate limiting, caching, and correct headers) — not trying to overpower a site.

A simple crawler pipeline (mental model)

1

Download

Fetch HTML/JSON with cURL (headers, cookies, timeouts, retries).

2

Parse

DOMDocument/XPath for HTML, json_decode for APIs and JSON-LD.

3

Extract + normalize

Turn pages into stable fields (absolute URLs, canonical IDs, clean text).

4

Store + monitor

Save structured outputs, detect drift, alert on failures and anomalies.

If you need a downloader function, see: Downloading a Webpage Using PHP and cURL.

Normalize URLs (relative → absolute, dedupe, clean)

URL normalization is one of the easiest ways to improve crawl quality. It reduces duplicate downloads and prevents missing pages due to relative paths. This helper resolves relative links against the current page URL.

PHP: Normalize a URL against a base URL
function normalizeUrl($maybeRelative, $baseUrl) {
    $maybeRelative = trim($maybeRelative);

    // Ignore anchors, javascript:, mailto:, tel:
    if ($maybeRelative === '' ||
        str_starts_with($maybeRelative, '#') ||
        preg_match('/^(javascript:|mailto:|tel:)/i', $maybeRelative)) {
        return null;
    }

    // If it's already absolute, return it (minus fragment)
    if (preg_match('/^https?:\/\//i', $maybeRelative)) {
        return preg_replace('/#.*$/', '', $maybeRelative);
    }

    $base = parse_url($baseUrl);
    if (!$base || empty($base['scheme']) || empty($base['host'])) { return null; }

    $scheme = $base['scheme'];
    $host   = $base['host'];
    $port   = isset($base['port']) ? ':' . $base['port'] : '';
    $path   = isset($base['path']) ? $base['path'] : '/';

    // Build base directory (strip filename)
    $dir = preg_replace('/\/[^\/]*$/', '/', $path);

    // Handle protocol-relative URLs: //cdn.example.com/x
    if (str_starts_with($maybeRelative, '//')) {
        return preg_replace('/#.*$/', '', $scheme . ':' . $maybeRelative);
    }

    // Root-relative: /x/y
    if (str_starts_with($maybeRelative, '/')) {
        $abs = $scheme . '://' . $host . $port . $maybeRelative;
        return preg_replace('/#.*$/', '', $abs);
    }

    // Path-relative: x/y
    $abs = $scheme . '://' . $host . $port . $dir . $maybeRelative;

    // Clean /./ and /../ segments (simple normalization)
    $abs = preg_replace('#/\.?/#', '/', $abs);
    while (preg_match('#/(?!\.\.)[^/]+/\.\./#', $abs)) {
        $abs = preg_replace('#/(?!\.\.)[^/]+/\.\./#', '/', $abs);
    }

    return preg_replace('/#.*$/', '', $abs);
}
  • Deduplicate: store normalized URLs in a set (hash map) before enqueueing.
  • Canonicalize: optionally strip tracking parameters (utm_*) if they cause duplicates.
  • Scope control: restrict to a domain or path prefix to prevent “infinite crawl.”

Extract images from a webpage in PHP

For images, you usually care about src, alt, and sometimes srcset (responsive images). Like links, image URLs often need normalization.

PHP: Extract images (src, alt, srcset)
// Assuming your HTML is in $contents and current URL is $pageUrl
libxml_use_internal_errors(true);

$document = new DOMDocument();
$document->loadHTML($contents, LIBXML_NOWARNING | LIBXML_NOERROR);

$imgs = $document->getElementsByTagName('img');

foreach ($imgs as $node) {
    $src    = trim($node->getAttribute('src'));
    $alt    = trim($node->getAttribute('alt'));
    $srcset = trim($node->getAttribute('srcset'));

    // OPTIONAL: normalize URL
    // $srcAbs = normalizeUrl($src, $pageUrl);

    // Store or process: $src / $alt / $srcset
}

libxml_clear_errors();
Tip: Many sites lazy-load images. If src is empty, check attributes like data-src, data-original, or data-lazy.

Extract JSON from a webpage (use JSON-LD first)

If you’re crawling e-commerce pages, job pages, or listings, the cleanest data is often embedded as JSON-LD: <script type="application/ld+json">. Parse that first. Regex should be a fallback, not your primary method.

PHP: Extract JSON-LD blocks (recommended)
// Returns an array of decoded JSON-LD objects found in the HTML
function extractJsonLd($contents) {
    libxml_use_internal_errors(true);

    $document = new DOMDocument();
    $document->loadHTML($contents, LIBXML_NOWARNING | LIBXML_NOERROR);

    $out = [];
    $scripts = $document->getElementsByTagName('script');

    foreach ($scripts as $s) {
        $type = strtolower(trim($s->getAttribute('type')));
        if ($type !== 'application/ld+json') { continue; }

        $raw = trim($s->textContent);
        if ($raw === '') { continue; }

        $decoded = json_decode($raw, true);
        if (json_last_error() === JSON_ERROR_NONE) {
            $out[] = $decoded;
        }
    }

    libxml_clear_errors();
    return $out;
}
PHP: Regex JSON fallback (use carefully)
// Regex-based JSON extraction is fragile; prefer JSON-LD when available.
function extractJsonFallback($contents) {
    $matches = [];
    preg_match_all('/\{(?:[^{}]|(?R))*\}/', $contents, $matches);

    $valid = [];
    foreach ($matches[0] as $candidate) {
        $decoded = json_decode($candidate, true);
        if (json_last_error() === JSON_ERROR_NONE) {
            $valid[] = $decoded;
        }
    }
    return $valid;
}
Practical strategy: In production, you’ll get better results by targeting known JSON containers: JSON-LD, “window.__STATE__”, “__NEXT_DATA__”, or documented API endpoints — rather than scanning entire HTML with regex.

Avoid getting blocked (the responsible ladder)

If your PHP crawler is getting blocked, start by assuming the site is protecting itself from high request volume or “bot-like” behavior. You’ll get farther by being polite and consistent than by immediately rotating infrastructure.

1

Rate limit + backoff

Slow down. Add random jitter. Use exponential backoff on 429/503 responses.

2

Send realistic headers

Set User-Agent, Accept, Accept-Language. Maintain cookies if the site expects sessions.

3

Cache and avoid re-downloading

Store responses. Use ETags/If-Modified-Since when possible to reduce load.

4

Robots + scope control

Check robots.txt and keep your frontier bounded (domain/path rules) to prevent accidental hammering.

5

Infrastructure choices (when permitted)

For recurring crawls, a stable IP strategy can improve reliability and debugging more than constant rotation.

PHP: Proxy settings in cURL (only when appropriate)
// If you are permitted to use a proxy for your use case:
curl_setopt($handle, CURLOPT_PROXY, $proxyIpAddress);
curl_setopt($handle, CURLOPT_PROXYPORT, $proxyPort);
Important: Proxies aren’t a magic fix. The best crawlers reduce blocks by behaving well (rate limiting, caching, sessions) and by engineering for durability.

Questions About PHP Web Crawlers

Common questions we see when teams build PHP web crawlers for link discovery, e-commerce extraction, monitoring, and recurring data collection.

What’s the best way to extract data in a PHP web crawler? +

For HTML pages, start with DOMDocument and (when needed) XPath selectors. For structured fields on modern sites, look for JSON-LD or API endpoints that return clean JSON. The “best” method depends on how stable the page is.

Why do my DOMDocument extractions break on real pages? +

Real-world HTML is frequently invalid. Enable libxml internal errors and treat parsing as “best effort.” If you need precision, add XPath rules and fallback selectors rather than relying on a single brittle pattern.

How do I handle relative links when crawling a site? +

Normalize every discovered URL against the current page URL (relative → absolute), remove fragments, and deduplicate before enqueueing. Without normalization, crawlers miss pages and waste time re-downloading duplicates.

What should I do if my PHP crawler gets blocked? +

Start with politeness and stability: rate limiting, caching, backoff on errors, realistic headers, and session handling. Proxies are sometimes useful, but they’re not the first or best fix for most durable crawlers.

Rule of thumb: If you can’t crawl slowly and reliably, you can’t crawl fast reliably either.
Can Potent Pages build and maintain this as a managed pipeline? +

Yes. We build production crawlers that run on schedules, monitor for breakage, and deliver structured outputs (CSV, database exports, API feeds, dashboards).

Want this to run in production (monitoring included)?

If your PHP crawler is feeding a business process, research workflow, or recurring monitoring job, consider designing it like infrastructure: stable schemas, quality checks, and alerts when extraction breaks.

Next step: Send us the target sites, the fields you need, and how often you need updates — we’ll help you scope the fastest path to a durable pipeline.
David Selden-Treiman, Director of Operations at Potent Pages.

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Web Crawler Industries

There are a lot of uses of web crawlers across industries to generate strategic advantages and alpha. Industries benefiting from web crawlers include:

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

Scroll To Top