Give us a call: (800) 252-6164
PHP · Web Crawling · Website Spider Tutorial

SIMPLE PHP WEB CRAWLER
Download Pages, Follow Links, Store Structured Results

Want to crawl a website with PHP to audit titles and meta descriptions, build a small internal dataset, or power a lightweight monitoring workflow? This step-by-step tutorial shows how to build a basic PHP website crawler (a “spider”) using cURL + DOMDocument, with clean scoping, rate limiting, and a MySQL schema to store results.

  • Discover internal links
  • Extract title / description / H1
  • Store results in MySQL
  • Be polite with delays
Important: Only crawl websites you have permission to access. Always respect robots.txt, rate limits, and terms. (In a follow-up tutorial, you can add an automated robots.txt allow/deny check.)

What you’ll build (in plain English)

You’ll build a small PHP “spider” that starts from a seed URL, downloads HTML pages, extracts a few fields (<title>, meta description, and the first <h1>), discovers new internal links, and repeats until there are no new pages left to crawl.

  • Use case: SEO audits, content inventories, lightweight internal monitoring, quick site mapping.
  • Output: a MySQL table you can query, export, or extend.
  • Scope: same host (no crawling the entire internet).

Prerequisites

  • PHP 7.4+ (8.x works as well)
  • cURL enabled in PHP
  • MySQL / MariaDB access
  • CLI access preferred (so you can run long crawls without browser timeouts)
Keep it simple: We’ll use cURL to download pages and DOMDocument to parse HTML. This is a classic baseline approach for a simple PHP web crawler.

Step 1: Create the database schema

This schema stores each URL path once, remembers where it was discovered, and keeps the extracted fields. You can extend it later to store canonical URLs, status codes, content hashes, or crawl depth.

CREATE DATABASE phpCrawlerTutorial;
USE phpCrawlerTutorial;

CREATE TABLE pages (
  url          VARCHAR(2048) NOT NULL,
  referer      TEXT,
  http_status  INT NULL,
  title        TEXT,
  description  TEXT,
  h1           TEXT,
  download_time TIMESTAMP NULL DEFAULT NULL,
  PRIMARY KEY (url)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Step 2: Connect to MySQL

Use your own credentials. If you plan to share this script internally, move secrets into environment variables.

$mysql_host = '';
$mysql_username = '';
$mysql_password = '';
$mysql_database = 'phpCrawlerTutorial';

$mysql_conn = mysqli_connect($mysql_host, $mysql_username, $mysql_password, $mysql_database);
if (!$mysql_conn) {
  echo "Error: Unable to connect to MySQL." . PHP_EOL;
  echo "Debugging errno: " . mysqli_connect_errno() . PHP_EOL;
  echo "Debugging error: " . mysqli_connect_error() . PHP_EOL;
  exit;
}

Step 3: Download pages with cURL (with safe defaults)

A simple crawler needs predictable behavior: follow redirects, return the body, and record the HTTP status code. We also set a clear user-agent so site owners can identify your crawler.

function http_get($url, $referer = '') {
  $ch = curl_init();

  curl_setopt_array($ch, [
    CURLOPT_URL            => $url,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_MAXREDIRS      => 5,
    CURLOPT_CONNECTTIMEOUT => 15,
    CURLOPT_TIMEOUT        => 30,
    CURLOPT_USERAGENT      => 'potentpages-php-crawler-tutorial/1.0',
    CURLOPT_REFERER        => $referer,
    CURLOPT_HEADER         => false,
  ]);

  $body = curl_exec($ch);
  $status = curl_getinfo($ch, CURLINFO_RESPONSE_CODE);

  if ($body === false) {
    $err = curl_error($ch);
    curl_close($ch);
    return ['status' => 0, 'body' => '', 'error' => $err];
  }

  curl_close($ch);
  return ['status' => (int)$status, 'body' => $body, 'error' => ''];
}
Politeness: In the crawl loop, add a delay (5–10 seconds is a reasonable baseline) to avoid burdening servers and getting blocked.

Step 4: Parse HTML and extract title, meta description, and H1

DOMDocument lets you parse “real-world” HTML (even when it’s imperfect). We’ll extract: title, the meta name=”description”, and the first h1.

function parse_html_fields($html) {
  $doc = new DOMDocument();
  libxml_use_internal_errors(true);
  $doc->loadHTML($html);
  libxml_clear_errors();

  $title = '';
  $titleTags = $doc->getElementsByTagName('title');
  if ($titleTags->length > 0) {
    $title = trim($titleTags->item(0)->nodeValue);
  }

  $description = '';
  $metaTags = $doc->getElementsByTagName('meta');
  foreach ($metaTags as $tag) {
    if (strtolower($tag->getAttribute('name')) === 'description') {
      $description = trim($tag->getAttribute('content'));
      break;
    }
  }

  $h1 = '';
  $h1Tags = $doc->getElementsByTagName('h1');
  if ($h1Tags->length > 0) {
    $h1 = trim($h1Tags->item(0)->nodeValue);
  }

  return ['title' => $title, 'description' => $description, 'h1' => $h1, 'doc' => $doc];
}

Step 5: Discover internal links (the “spider” part)

Crawling is more than scraping: you must discover URLs and keep a queue. Below is a simple internal-link extractor: it keeps only same-host links, normalizes them, and ignores obvious non-page links.

function normalize_url($url) {
  // Remove fragment
  $hashPos = strpos($url, '#');
  if ($hashPos !== false) $url = substr($url, 0, $hashPos);
  return trim($url);
}

function extract_internal_links($doc, $baseUrl, $allowedHost) {
  $links = [];
  $aTags = $doc->getElementsByTagName('a');

  foreach ($aTags as $a) {
    $href = $a->getAttribute('href');
    if (!$href) continue;

    $href = normalize_url($href);

    // Ignore mailto/tel/javascript
    if (preg_match('/^(mailto:|tel:|javascript:)/i', $href)) continue;

    // Convert relative to absolute
    $abs = url_to_absolute($baseUrl, $href);
    if (!$abs) continue;

    $parts = parse_url($abs);
    if (!$parts || empty($parts['host'])) continue;

    // Same host only
    if (strtolower($parts['host']) !== strtolower($allowedHost)) continue;

    // Optional: only crawl http(s)
    $scheme = isset($parts['scheme']) ? strtolower($parts['scheme']) : '';
    if (!in_array($scheme, ['http','https'], true)) continue;

    $clean = $scheme . '://' . $parts['host'] . (isset($parts['path']) ? $parts['path'] : '/');
    // You can keep query strings if your site requires them, but it increases crawl volume.
    $links[$clean] = true;
  }

  return array_keys($links);
}
Relative URL helper: The url_to_absolute() function can be implemented using your existing relative-to-absolute approach. Keep it as a separate utility so you can improve it later.

Step 6: Crawl loop (queue → download → parse → store → repeat)

This is the minimal “crawler engine.” It: (1) seeds a URL, (2) downloads, (3) extracts fields, (4) stores results, (5) discovers more links, and (6) continues.

$seed_url = 'https://example.com/';
$seed_parts = parse_url($seed_url);
if (!$seed_parts || empty($seed_parts['host'])) die("Invalid seed URL\n");
$allowed_host = $seed_parts['host'];

$queue = [$seed_url => ''];     // url => referer
$seen  = [$seed_url => true];

$max_pages = 500;              // safety cap for “simple crawler”
$delay_seconds = 8;            // politeness delay

while (!empty($queue) && $max_pages > 0) {
  $url = array_key_first($queue);
  $referer = $queue[$url];
  unset($queue[$url]);

  echo "Downloading: $url\n";
  $res = http_get($url, $referer);
  $status = (int)$res['status'];

  // Store status + download_time even on failures
  $url_esc = mysqli_real_escape_string($mysql_conn, $url);
  $ref_esc = mysqli_real_escape_string($mysql_conn, $referer);
  $q = "INSERT INTO pages (url, referer, http_status, download_time)
        VALUES ('$url_esc', '$ref_esc', $status, NOW())
        ON DUPLICATE KEY UPDATE
          referer=VALUES(referer),
          http_status=VALUES(http_status),
          download_time=VALUES(download_time)";
  mysqli_query($mysql_conn, $q);

  if ($status !== 200 || !$res['body']) {
    echo "Skip (status=$status)\n";
    sleep($delay_seconds);
    continue;
  }

  $parsed = parse_html_fields($res['body']);
  $title = mysqli_real_escape_string($mysql_conn, $parsed['title']);
  $desc  = mysqli_real_escape_string($mysql_conn, $parsed['description']);
  $h1    = mysqli_real_escape_string($mysql_conn, $parsed['h1']);

  $q2 = "UPDATE pages SET
          title='$title',
          description='$desc',
          h1='$h1'
        WHERE url='$url_esc'";
  mysqli_query($mysql_conn, $q2);

  $newLinks = extract_internal_links($parsed['doc'], $url, $allowed_host);
  foreach ($newLinks as $lnk) {
    if (!isset($seen[$lnk])) {
      $seen[$lnk] = true;
      $queue[$lnk] = $url; // set referer
    }
  }

  $max_pages--;
  sleep($delay_seconds);
}

echo "Done. Crawled " . count($seen) . " discovered URLs.\n";
SEO-friendly output: Once your crawl finishes, you can query for missing titles, duplicate titles, missing meta descriptions, or pages with empty H1s.

Common issues (and quick fixes)

  • Getting blocked? Increase delay, set a clearer user-agent, reduce concurrency (this tutorial uses none), and stay within scope.
  • Pages are “HTML” but empty? Some sites render content with JavaScript. For those, you’ll need a headless browser workflow instead of plain cURL.
  • Infinite crawling? Add a max_pages cap (shown) and consider ignoring calendars, search pages, or query-string variations.
  • Duplicate URLs? Normalize URLs (remove fragments, consider removing tracking query params, enforce trailing slash rules).

FAQ: PHP web crawler basics

What is the difference between a PHP web crawler and a web scraper?

A web scraper extracts data from a known page or list of pages. A web crawler discovers new pages by following links. Many real tools do both: crawl to find pages, scrape to extract fields.

Is this PHP website crawler safe to run on any site?

Only run it where you have permission. Respect robots.txt, terms, and server load. Use a delay and a safety cap. If you need a “production-grade” crawler, add monitoring, retries, and automated robots rules.

Can this crawler download an entire website?

It can discover many internal pages, but “entire website” depends on authentication, JavaScript rendering, query-driven pages, and scope rules. For large sites, you’ll want stronger URL normalization, dedupe, and crawl policies.

How do I use this for an SEO audit?

Store title/description/H1, then query your MySQL table for missing or duplicate fields. You can also extend it to store canonical tags, status codes, response times, and word counts.

When should I hire Potent Pages instead of maintaining a script?

If your business depends on reliable data over time, a one-off script becomes expensive to babysit. Potent Pages builds durable pipelines: monitoring, repair workflows, clean delivery, and long-running operation—so your team isn’t stuck maintaining scrapers.

Want this as a durable data pipeline?

If you need a crawler that survives site changes and delivers clean data (CSV / DB / API), we can scope it quickly.

David Selden-Treiman, Director of Operations at Potent Pages.

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Web Crawler Industries

There are a lot of uses of web crawlers across industries to generate strategic advantages and alpha. Industries benefiting from web crawlers include:

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

Scroll To Top