What you’ll build
We’ll create a function (think: http_get()) that downloads a webpage using PHP’s cURL extension. You’ll pass a target URL and an optional referer. The function returns a structured array with:
- status_code (e.g., 200, 301, 404, 429)
- headers (parsed into a clean associative array)
- body (HTML source)
- final_url (after redirects)
- error (if the request fails)
Why not just use file_get_contents()?
For simple scripts, file_get_contents() can work — but cURL gives you the controls that crawlers need:
cookies, redirects, timeouts, user agents, compression, and consistent access to response metadata.
The function: download a webpage using PHP and cURL
Below is a practical, “crawler-friendly” implementation. It uses a header callback so redirects and multi-header responses don’t break parsing.
PHP (copy/paste)<?php
/**
* Download a webpage via HTTP GET using cURL
* Developed By: Potent Pages, LLC (https://potentpages.com/)
*
* Notes:
* - Returns status code, headers, body, final URL, and timing/error details.
* - Uses a header callback so redirects don't break header parsing.
*/
function http_get($target, $referer = "", $cookieJar = "cookie_jar.txt", $cookieFile = "cookies.txt") {
$ch = curl_init();
// We'll accumulate headers in a normalized structure.
$headers = [];
$lastStatusLine = null;
// Parse headers reliably (supports redirects and multiple header blocks).
curl_setopt($ch, CURLOPT_HEADERFUNCTION, function($curl, $headerLine) use (&$headers, &$lastStatusLine) {
$len = strlen($headerLine);
$headerLine = trim($headerLine);
if ($headerLine === "") {
return $len;
}
// Status line (e.g., HTTP/1.1 200 OK)
if (stripos($headerLine, "HTTP/") === 0) {
$lastStatusLine = $headerLine;
$headers["http_status_line"] = $headerLine;
return $len;
}
// Normal header line (split on first ":" only)
$parts = explode(":", $headerLine, 2);
if (count($parts) === 2) {
$key = strtolower(trim($parts[0]));
$value = trim($parts[1]);
// Support repeated headers (set-cookie, etc.)
if (isset($headers[$key])) {
if (!is_array($headers[$key])) {
$headers[$key] = [$headers[$key]];
}
$headers[$key][] = $value;
} else {
$headers[$key] = $value;
}
}
return $len;
});
// Core request settings
curl_setopt($ch, CURLOPT_URL, $target);
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Identify your crawler (set this to something meaningful)
curl_setopt($ch, CURLOPT_USERAGENT, "potentpages-tutorial-crawler/1.0 (+https://potentpages.com/)");
// Optional referer (useful when simulating normal navigation)
if (!empty($referer)) {
curl_setopt($ch, CURLOPT_REFERER, $referer);
}
// Cookies (helpful for sessions and multi-page flows)
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieJar);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
// Redirect handling
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 6);
// Timeouts (critical for crawlers)
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 12);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
// Compression (saves bandwidth, common on modern sites)
curl_setopt($ch, CURLOPT_ENCODING, "");
// Safer TLS defaults
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
// Optional: send common headers
curl_setopt($ch, CURLOPT_HTTPHEADER, [
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language: en-US,en;q=0.9",
"Connection: keep-alive",
]);
$body = curl_exec($ch);
$errno = curl_errno($ch);
$error = $errno ? curl_error($ch) : null;
// Useful metadata for debugging / logging
$info = curl_getinfo($ch);
curl_close($ch);
return [
"ok" => ($errno === 0),
"status_code" => isset($info["http_code"]) ? (int)$info["http_code"] : 0,
"headers" => $headers,
"body" => $body,
"final_url" => isset($info["url"]) ? $info["url"] : $target,
"timing" => [
"total_time" => $info["total_time"] ?? null,
"namelookup_time" => $info["namelookup_time"] ?? null,
"connect_time" => $info["connect_time"] ?? null,
"starttransfer_time" => $info["starttransfer_time"] ?? null,
],
"error" => $error,
];
}
// Example usage:
$page = http_get("https://potentpages.com/", "");
if ($page["ok"]) {
echo "Status: " . $page["status_code"] . "\\n";
echo "Final URL: " . $page["final_url"] . "\\n";
echo "Body length: " . strlen($page["body"]) . "\\n";
} else {
echo "Request failed: " . $page["error"] . "\\n";
}
How it works (in plain English)
- cURL executes an HTTP GET to your target URL.
- CURLOPT_HEADERFUNCTION receives each response header line as it arrives, so parsing remains correct even after redirects.
- Cookies allow multi-page crawling sessions.
- Timeouts prevent your crawler from hanging forever on slow or broken endpoints.
- curl_getinfo() returns status code, timing, and the final resolved URL.
Common issues (403 / 429 / JS-heavy pages)
As soon as you go beyond a few pages, you’ll see real-world friction. Here’s a quick diagnostic map:
- 403 Forbidden: your request looks “non-human” (headers/UA), or the site blocks bots. Add throttling, rotate IPs carefully, and confirm permissions.
- 429 Too Many Requests: you’re hitting rate limits. Add delays, exponential backoff, and respect Retry-After if present.
- Empty content / missing data: the site renders content via JavaScript. Use a browser approach (e.g., Selenium).
Next step: turn this into a crawler
This downloader is the first building block. The next step is to repeatedly call it across discovered links, store HTML, extract data, and avoid revisiting the same pages.
FAQ: Downloading webpages with PHP & cURL
These are common questions people ask when building PHP web crawlers and download scripts.
How do I download HTML from a URL in PHP?
Use PHP’s cURL extension to send an HTTP GET request and return the response body. cURL is preferred over
file_get_contents() when you need status codes, headers, cookies, or timeouts.
How can I get the HTTP status code with cURL in PHP?
After curl_exec(), call curl_getinfo($ch) and read http_code.
This tutorial’s function returns status_code directly.
How do I follow redirects (301/302) in PHP cURL?
Enable CURLOPT_FOLLOWLOCATION and set CURLOPT_MAXREDIRS to prevent loops.
Also capture the final URL via curl_getinfo().
Why is my cURL request blocked (403 or 429)?
Many sites rate-limit or block automated requests. Use a clear user agent, slow down your crawl, implement retries with backoff, and confirm you’re allowed to access the pages.
When should I use Selenium instead of cURL?
Use Selenium when the data you need is rendered by JavaScript after the initial HTML loads (or when you need to interact with the page like clicking or scrolling).
Want a crawler that stays running?
If you need ongoing collection, monitoring, and structured delivery, we build managed web crawlers for teams that need reliability.
