What you’ll build (in plain English)
You’ll build a small PHP “spider” that starts from a seed URL, downloads HTML pages, extracts a few fields (<title>, meta description, and the first <h1>), discovers new internal links, and repeats until there are no new pages left to crawl.
- Use case: SEO audits, content inventories, lightweight internal monitoring, quick site mapping.
- Output: a MySQL table you can query, export, or extend.
- Scope: same host (no crawling the entire internet).
Prerequisites
- PHP 7.4+ (8.x works as well)
- cURL enabled in PHP
- MySQL / MariaDB access
- CLI access preferred (so you can run long crawls without browser timeouts)
Step 1: Create the database schema
This schema stores each URL path once, remembers where it was discovered, and keeps the extracted fields. You can extend it later to store canonical URLs, status codes, content hashes, or crawl depth.
CREATE DATABASE phpCrawlerTutorial;
USE phpCrawlerTutorial;
CREATE TABLE pages (
url VARCHAR(2048) NOT NULL,
referer TEXT,
http_status INT NULL,
title TEXT,
description TEXT,
h1 TEXT,
download_time TIMESTAMP NULL DEFAULT NULL,
PRIMARY KEY (url)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Step 2: Connect to MySQL
Use your own credentials. If you plan to share this script internally, move secrets into environment variables.
$mysql_host = '';
$mysql_username = '';
$mysql_password = '';
$mysql_database = 'phpCrawlerTutorial';
$mysql_conn = mysqli_connect($mysql_host, $mysql_username, $mysql_password, $mysql_database);
if (!$mysql_conn) {
echo "Error: Unable to connect to MySQL." . PHP_EOL;
echo "Debugging errno: " . mysqli_connect_errno() . PHP_EOL;
echo "Debugging error: " . mysqli_connect_error() . PHP_EOL;
exit;
}
Step 3: Download pages with cURL (with safe defaults)
A simple crawler needs predictable behavior: follow redirects, return the body, and record the HTTP status code. We also set a clear user-agent so site owners can identify your crawler.
function http_get($url, $referer = '') {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5,
CURLOPT_CONNECTTIMEOUT => 15,
CURLOPT_TIMEOUT => 30,
CURLOPT_USERAGENT => 'potentpages-php-crawler-tutorial/1.0',
CURLOPT_REFERER => $referer,
CURLOPT_HEADER => false,
]);
$body = curl_exec($ch);
$status = curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
if ($body === false) {
$err = curl_error($ch);
curl_close($ch);
return ['status' => 0, 'body' => '', 'error' => $err];
}
curl_close($ch);
return ['status' => (int)$status, 'body' => $body, 'error' => ''];
}
Step 4: Parse HTML and extract title, meta description, and H1
DOMDocument lets you parse “real-world” HTML (even when it’s imperfect). We’ll extract: title, the meta name=”description”, and the first h1.
function parse_html_fields($html) {
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_clear_errors();
$title = '';
$titleTags = $doc->getElementsByTagName('title');
if ($titleTags->length > 0) {
$title = trim($titleTags->item(0)->nodeValue);
}
$description = '';
$metaTags = $doc->getElementsByTagName('meta');
foreach ($metaTags as $tag) {
if (strtolower($tag->getAttribute('name')) === 'description') {
$description = trim($tag->getAttribute('content'));
break;
}
}
$h1 = '';
$h1Tags = $doc->getElementsByTagName('h1');
if ($h1Tags->length > 0) {
$h1 = trim($h1Tags->item(0)->nodeValue);
}
return ['title' => $title, 'description' => $description, 'h1' => $h1, 'doc' => $doc];
}
Step 5: Discover internal links (the “spider” part)
Crawling is more than scraping: you must discover URLs and keep a queue. Below is a simple internal-link extractor: it keeps only same-host links, normalizes them, and ignores obvious non-page links.
function normalize_url($url) {
// Remove fragment
$hashPos = strpos($url, '#');
if ($hashPos !== false) $url = substr($url, 0, $hashPos);
return trim($url);
}
function extract_internal_links($doc, $baseUrl, $allowedHost) {
$links = [];
$aTags = $doc->getElementsByTagName('a');
foreach ($aTags as $a) {
$href = $a->getAttribute('href');
if (!$href) continue;
$href = normalize_url($href);
// Ignore mailto/tel/javascript
if (preg_match('/^(mailto:|tel:|javascript:)/i', $href)) continue;
// Convert relative to absolute
$abs = url_to_absolute($baseUrl, $href);
if (!$abs) continue;
$parts = parse_url($abs);
if (!$parts || empty($parts['host'])) continue;
// Same host only
if (strtolower($parts['host']) !== strtolower($allowedHost)) continue;
// Optional: only crawl http(s)
$scheme = isset($parts['scheme']) ? strtolower($parts['scheme']) : '';
if (!in_array($scheme, ['http','https'], true)) continue;
$clean = $scheme . '://' . $parts['host'] . (isset($parts['path']) ? $parts['path'] : '/');
// You can keep query strings if your site requires them, but it increases crawl volume.
$links[$clean] = true;
}
return array_keys($links);
}
Step 6: Crawl loop (queue → download → parse → store → repeat)
This is the minimal “crawler engine.” It: (1) seeds a URL, (2) downloads, (3) extracts fields, (4) stores results, (5) discovers more links, and (6) continues.
$seed_url = 'https://example.com/';
$seed_parts = parse_url($seed_url);
if (!$seed_parts || empty($seed_parts['host'])) die("Invalid seed URL\n");
$allowed_host = $seed_parts['host'];
$queue = [$seed_url => '']; // url => referer
$seen = [$seed_url => true];
$max_pages = 500; // safety cap for “simple crawler”
$delay_seconds = 8; // politeness delay
while (!empty($queue) && $max_pages > 0) {
$url = array_key_first($queue);
$referer = $queue[$url];
unset($queue[$url]);
echo "Downloading: $url\n";
$res = http_get($url, $referer);
$status = (int)$res['status'];
// Store status + download_time even on failures
$url_esc = mysqli_real_escape_string($mysql_conn, $url);
$ref_esc = mysqli_real_escape_string($mysql_conn, $referer);
$q = "INSERT INTO pages (url, referer, http_status, download_time)
VALUES ('$url_esc', '$ref_esc', $status, NOW())
ON DUPLICATE KEY UPDATE
referer=VALUES(referer),
http_status=VALUES(http_status),
download_time=VALUES(download_time)";
mysqli_query($mysql_conn, $q);
if ($status !== 200 || !$res['body']) {
echo "Skip (status=$status)\n";
sleep($delay_seconds);
continue;
}
$parsed = parse_html_fields($res['body']);
$title = mysqli_real_escape_string($mysql_conn, $parsed['title']);
$desc = mysqli_real_escape_string($mysql_conn, $parsed['description']);
$h1 = mysqli_real_escape_string($mysql_conn, $parsed['h1']);
$q2 = "UPDATE pages SET
title='$title',
description='$desc',
h1='$h1'
WHERE url='$url_esc'";
mysqli_query($mysql_conn, $q2);
$newLinks = extract_internal_links($parsed['doc'], $url, $allowed_host);
foreach ($newLinks as $lnk) {
if (!isset($seen[$lnk])) {
$seen[$lnk] = true;
$queue[$lnk] = $url; // set referer
}
}
$max_pages--;
sleep($delay_seconds);
}
echo "Done. Crawled " . count($seen) . " discovered URLs.\n";
Common issues (and quick fixes)
- Getting blocked? Increase delay, set a clearer user-agent, reduce concurrency (this tutorial uses none), and stay within scope.
- Pages are “HTML” but empty? Some sites render content with JavaScript. For those, you’ll need a headless browser workflow instead of plain cURL.
- Infinite crawling? Add a max_pages cap (shown) and consider ignoring calendars, search pages, or query-string variations.
- Duplicate URLs? Normalize URLs (remove fragments, consider removing tracking query params, enforce trailing slash rules).
FAQ: PHP web crawler basics
What is the difference between a PHP web crawler and a web scraper?
A web scraper extracts data from a known page or list of pages. A web crawler discovers new pages by following links. Many real tools do both: crawl to find pages, scrape to extract fields.
Is this PHP website crawler safe to run on any site?
Only run it where you have permission. Respect robots.txt, terms, and server load. Use a delay and a safety cap. If you need a “production-grade” crawler, add monitoring, retries, and automated robots rules.
Can this crawler download an entire website?
It can discover many internal pages, but “entire website” depends on authentication, JavaScript rendering, query-driven pages, and scope rules. For large sites, you’ll want stronger URL normalization, dedupe, and crawl policies.
How do I use this for an SEO audit?
Store title/description/H1, then query your MySQL table for missing or duplicate fields. You can also extend it to store canonical tags, status codes, response times, and word counts.
When should I hire Potent Pages instead of maintaining a script?
If your business depends on reliable data over time, a one-off script becomes expensive to babysit. Potent Pages builds durable pipelines: monitoring, repair workflows, clean delivery, and long-running operation—so your team isn’t stuck maintaining scrapers.
Want this as a durable data pipeline?
If you need a crawler that survives site changes and delivers clean data (CSV / DB / API), we can scope it quickly.
