What you’ll build

We’ll create a function (think: http_get()) that downloads a webpage using PHP’s cURL extension. You’ll pass a target URL and an optional referer. The function returns a structured array with:

status_code (e.g., 200, 301, 404, 429)
headers (parsed into a clean associative array)
body (HTML source)
final_url (after redirects)
error (if the request fails)

Quick warning: Only crawl pages you have permission to access. Respect robots.txt and a site’s Terms of Service, and throttle your requests so you don’t harm performance.

Why not just use file_get_contents()?

For simple scripts, file_get_contents() can work — but cURL gives you the controls that crawlers need: cookies, redirects, timeouts, user agents, compression, and consistent access to response metadata.

Rule of thumb: If you care about status codes, headers, retries, or running this repeatedly at scale, use cURL.

The function: download a webpage using PHP and cURL

Below is a practical, “crawler-friendly” implementation. It uses a header callback so redirects and multi-header responses don’t break parsing.

PHP (copy/paste)<?php
/**
 * Download a webpage via HTTP GET using cURL
 * Developed By: Potent Pages, LLC (https://potentpages.com/)
 *
 * Notes:
 * - Returns status code, headers, body, final URL, and timing/error details.
 * - Uses a header callback so redirects don't break header parsing.
 */
function http_get($target, $referer = "", $cookieJar = "cookie_jar.txt", $cookieFile = "cookies.txt") {

    $ch = curl_init();

    // We'll accumulate headers in a normalized structure.
    $headers = [];
    $lastStatusLine = null;

    // Parse headers reliably (supports redirects and multiple header blocks).
    curl_setopt($ch, CURLOPT_HEADERFUNCTION, function($curl, $headerLine) use (&$headers, &$lastStatusLine) {
        $len = strlen($headerLine);
        $headerLine = trim($headerLine);

        if ($headerLine === "") {
            return $len;
        }

        // Status line (e.g., HTTP/1.1 200 OK)
        if (stripos($headerLine, "HTTP/") === 0) {
            $lastStatusLine = $headerLine;
            $headers["http_status_line"] = $headerLine;
            return $len;
        }

        // Normal header line (split on first ":" only)
        $parts = explode(":", $headerLine, 2);
        if (count($parts) === 2) {
            $key = strtolower(trim($parts[0]));
            $value = trim($parts[1]);

            // Support repeated headers (set-cookie, etc.)
            if (isset($headers[$key])) {
                if (!is_array($headers[$key])) {
                    $headers[$key] = [$headers[$key]];
                }
                $headers[$key][] = $value;
            } else {
                $headers[$key] = $value;
            }
        }

        return $len;
    });

    // Core request settings
    curl_setopt($ch, CURLOPT_URL, $target);
    curl_setopt($ch, CURLOPT_HTTPGET, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    // Identify your crawler (set this to something meaningful)
    curl_setopt($ch, CURLOPT_USERAGENT, "potentpages-tutorial-crawler/1.0 (+https://potentpages.com/)");

    // Optional referer (useful when simulating normal navigation)
    if (!empty($referer)) {
        curl_setopt($ch, CURLOPT_REFERER, $referer);
    }

    // Cookies (helpful for sessions and multi-page flows)
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieJar);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);

    // Redirect handling
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 6);

    // Timeouts (critical for crawlers)
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 12);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);

    // Compression (saves bandwidth, common on modern sites)
    curl_setopt($ch, CURLOPT_ENCODING, "");

    // Safer TLS defaults
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);

    // Optional: send common headers
    curl_setopt($ch, CURLOPT_HTTPHEADER, [
        "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language: en-US,en;q=0.9",
        "Connection: keep-alive",
    ]);

    $body = curl_exec($ch);

    $errno = curl_errno($ch);
    $error = $errno ? curl_error($ch) : null;

    // Useful metadata for debugging / logging
    $info = curl_getinfo($ch);
    curl_close($ch);

    return [
        "ok" => ($errno === 0),
        "status_code" => isset($info["http_code"]) ? (int)$info["http_code"] : 0,
        "headers" => $headers,
        "body" => $body,
        "final_url" => isset($info["url"]) ? $info["url"] : $target,
        "timing" => [
            "total_time" => $info["total_time"] ?? null,
            "namelookup_time" => $info["namelookup_time"] ?? null,
            "connect_time" => $info["connect_time"] ?? null,
            "starttransfer_time" => $info["starttransfer_time"] ?? null,
        ],
        "error" => $error,
    ];
}

// Example usage:
$page = http_get("https://potentpages.com/", "");
if ($page["ok"]) {
    echo "Status: " . $page["status_code"] . "\\n";
    echo "Final URL: " . $page["final_url"] . "\\n";
    echo "Body length: " . strlen($page["body"]) . "\\n";
} else {
    echo "Request failed: " . $page["error"] . "\\n";
}

SEO tip (matches search intent): Most people searching “download a webpage using PHP and cURL” want a function that returns both HTML and status code. This version returns both without brittle header splitting.

How it works (in plain English)

cURL executes an HTTP GET to your target URL.
CURLOPT_HEADERFUNCTION receives each response header line as it arrives, so parsing remains correct even after redirects.
Cookies allow multi-page crawling sessions.
Timeouts prevent your crawler from hanging forever on slow or broken endpoints.
curl_getinfo() returns status code, timing, and the final resolved URL.

Common issues (403 / 429 / JS-heavy pages)

As soon as you go beyond a few pages, you’ll see real-world friction. Here’s a quick diagnostic map:

403 Forbidden: your request looks “non-human” (headers/UA), or the site blocks bots. Add throttling, rotate IPs carefully, and confirm permissions.
429 Too Many Requests: you’re hitting rate limits. Add delays, exponential backoff, and respect Retry-After if present.
Empty content / missing data: the site renders content via JavaScript. Use a browser approach (e.g., Selenium).

If the page depends on JavaScript: Use a browser-based downloader. See: Downloading a Webpage Using Selenium & PHP.

Next step: turn this into a crawler

This downloader is the first building block. The next step is to repeatedly call it across discovered links, store HTML, extract data, and avoid revisiting the same pages.

Continue here: Creating a Simple PHP Website Crawler

FAQ: Downloading webpages with PHP & cURL

These are common questions people ask when building PHP web crawlers and download scripts.

How do I download HTML from a URL in PHP?

Use PHP’s cURL extension to send an HTTP GET request and return the response body. cURL is preferred over file_get_contents() when you need status codes, headers, cookies, or timeouts.

How can I get the HTTP status code with cURL in PHP?

After curl_exec(), call curl_getinfo($ch) and read http_code. This tutorial’s function returns status_code directly.

How do I follow redirects (301/302) in PHP cURL?

Enable CURLOPT_FOLLOWLOCATION and set CURLOPT_MAXREDIRS to prevent loops. Also capture the final URL via curl_getinfo().

Why is my cURL request blocked (403 or 429)?

Many sites rate-limit or block automated requests. Use a clear user agent, slow down your crawl, implement retries with backoff, and confirm you’re allowed to access the pages.

When should I use Selenium instead of cURL?

Use Selenium when the data you need is rendered by JavaScript after the initial HTML loads (or when you need to interact with the page like clicking or scrolling).

Want a crawler that stays running?

If you need ongoing collection, monitoring, and structured delivery, we build managed web crawlers for teams that need reliability.

Discuss your crawler → Crawler services ↗

DOWNLOAD A WEBPAGE
Using PHP + cURL (GET, headers, redirects, cookies)

What you’ll build

Why not just use file_get_contents()?

The function: download a webpage using PHP and cURL

How it works (in plain English)

Common issues (403 / 429 / JS-heavy pages)

Next step: turn this into a crawler

FAQ: Downloading webpages with PHP & cURL

Want a crawler that stays running?

Web Crawlers

Data Collection

Development

Web Crawler Industries

Building Your Own

Legality of Web Crawlers

GPT & Web Crawlers

DOWNLOAD A WEBPAGE Using PHP + cURL (GET, headers, redirects, cookies)

What you’ll build

Why not just use file_get_contents()?

The function: download a webpage using PHP and cURL

How it works (in plain English)

Common issues (403 / 429 / JS-heavy pages)

Next step: turn this into a crawler

FAQ: Downloading webpages with PHP & cURL

Want a crawler that stays running?

Web Crawlers

Data Collection

Development

Web Crawler Industries

Building Your Own

Legality of Web Crawlers

GPT & Web Crawlers

DOWNLOAD A WEBPAGE
Using PHP + cURL (GET, headers, redirects, cookies)