Give us a call: (800) 252-6164
PHP · cURL · Web Crawling Fundamentals

DOWNLOAD A WEBPAGE
Using PHP + cURL (GET, headers, redirects, cookies)

This tutorial gives you a small, reusable PHP function that downloads a URL via HTTP GET and returns: status code, response headers, body HTML, and final URL. It’s a practical building block for web crawlers, site monitors, and extraction scripts.

  • Reliable header parsing
  • Handles redirects + cookies
  • Safer timeouts + TLS defaults
  • Ready for crawling loops

What you’ll build

We’ll create a function (think: http_get()) that downloads a webpage using PHP’s cURL extension. You’ll pass a target URL and an optional referer. The function returns a structured array with:

  • status_code (e.g., 200, 301, 404, 429)
  • headers (parsed into a clean associative array)
  • body (HTML source)
  • final_url (after redirects)
  • error (if the request fails)
Quick warning: Only crawl pages you have permission to access. Respect robots.txt and a site’s Terms of Service, and throttle your requests so you don’t harm performance.

Why not just use file_get_contents()?

For simple scripts, file_get_contents() can work — but cURL gives you the controls that crawlers need: cookies, redirects, timeouts, user agents, compression, and consistent access to response metadata.

Rule of thumb: If you care about status codes, headers, retries, or running this repeatedly at scale, use cURL.

The function: download a webpage using PHP and cURL

Below is a practical, “crawler-friendly” implementation. It uses a header callback so redirects and multi-header responses don’t break parsing.

PHP (copy/paste)<?php
/**
 * Download a webpage via HTTP GET using cURL
 * Developed By: Potent Pages, LLC (https://potentpages.com/)
 *
 * Notes:
 * - Returns status code, headers, body, final URL, and timing/error details.
 * - Uses a header callback so redirects don't break header parsing.
 */
function http_get($target, $referer = "", $cookieJar = "cookie_jar.txt", $cookieFile = "cookies.txt") {

    $ch = curl_init();

    // We'll accumulate headers in a normalized structure.
    $headers = [];
    $lastStatusLine = null;

    // Parse headers reliably (supports redirects and multiple header blocks).
    curl_setopt($ch, CURLOPT_HEADERFUNCTION, function($curl, $headerLine) use (&$headers, &$lastStatusLine) {
        $len = strlen($headerLine);
        $headerLine = trim($headerLine);

        if ($headerLine === "") {
            return $len;
        }

        // Status line (e.g., HTTP/1.1 200 OK)
        if (stripos($headerLine, "HTTP/") === 0) {
            $lastStatusLine = $headerLine;
            $headers["http_status_line"] = $headerLine;
            return $len;
        }

        // Normal header line (split on first ":" only)
        $parts = explode(":", $headerLine, 2);
        if (count($parts) === 2) {
            $key = strtolower(trim($parts[0]));
            $value = trim($parts[1]);

            // Support repeated headers (set-cookie, etc.)
            if (isset($headers[$key])) {
                if (!is_array($headers[$key])) {
                    $headers[$key] = [$headers[$key]];
                }
                $headers[$key][] = $value;
            } else {
                $headers[$key] = $value;
            }
        }

        return $len;
    });

    // Core request settings
    curl_setopt($ch, CURLOPT_URL, $target);
    curl_setopt($ch, CURLOPT_HTTPGET, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    // Identify your crawler (set this to something meaningful)
    curl_setopt($ch, CURLOPT_USERAGENT, "potentpages-tutorial-crawler/1.0 (+https://potentpages.com/)");

    // Optional referer (useful when simulating normal navigation)
    if (!empty($referer)) {
        curl_setopt($ch, CURLOPT_REFERER, $referer);
    }

    // Cookies (helpful for sessions and multi-page flows)
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieJar);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);

    // Redirect handling
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 6);

    // Timeouts (critical for crawlers)
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 12);
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);

    // Compression (saves bandwidth, common on modern sites)
    curl_setopt($ch, CURLOPT_ENCODING, "");

    // Safer TLS defaults
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);

    // Optional: send common headers
    curl_setopt($ch, CURLOPT_HTTPHEADER, [
        "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language: en-US,en;q=0.9",
        "Connection: keep-alive",
    ]);

    $body = curl_exec($ch);

    $errno = curl_errno($ch);
    $error = $errno ? curl_error($ch) : null;

    // Useful metadata for debugging / logging
    $info = curl_getinfo($ch);
    curl_close($ch);

    return [
        "ok" => ($errno === 0),
        "status_code" => isset($info["http_code"]) ? (int)$info["http_code"] : 0,
        "headers" => $headers,
        "body" => $body,
        "final_url" => isset($info["url"]) ? $info["url"] : $target,
        "timing" => [
            "total_time" => $info["total_time"] ?? null,
            "namelookup_time" => $info["namelookup_time"] ?? null,
            "connect_time" => $info["connect_time"] ?? null,
            "starttransfer_time" => $info["starttransfer_time"] ?? null,
        ],
        "error" => $error,
    ];
}

// Example usage:
$page = http_get("https://potentpages.com/", "");
if ($page["ok"]) {
    echo "Status: " . $page["status_code"] . "\\n";
    echo "Final URL: " . $page["final_url"] . "\\n";
    echo "Body length: " . strlen($page["body"]) . "\\n";
} else {
    echo "Request failed: " . $page["error"] . "\\n";
}
SEO tip (matches search intent): Most people searching “download a webpage using PHP and cURL” want a function that returns both HTML and status code. This version returns both without brittle header splitting.

How it works (in plain English)

  • cURL executes an HTTP GET to your target URL.
  • CURLOPT_HEADERFUNCTION receives each response header line as it arrives, so parsing remains correct even after redirects.
  • Cookies allow multi-page crawling sessions.
  • Timeouts prevent your crawler from hanging forever on slow or broken endpoints.
  • curl_getinfo() returns status code, timing, and the final resolved URL.

Common issues (403 / 429 / JS-heavy pages)

As soon as you go beyond a few pages, you’ll see real-world friction. Here’s a quick diagnostic map:

  • 403 Forbidden: your request looks “non-human” (headers/UA), or the site blocks bots. Add throttling, rotate IPs carefully, and confirm permissions.
  • 429 Too Many Requests: you’re hitting rate limits. Add delays, exponential backoff, and respect Retry-After if present.
  • Empty content / missing data: the site renders content via JavaScript. Use a browser approach (e.g., Selenium).
If the page depends on JavaScript: Use a browser-based downloader. See: Downloading a Webpage Using Selenium & PHP.

Next step: turn this into a crawler

This downloader is the first building block. The next step is to repeatedly call it across discovered links, store HTML, extract data, and avoid revisiting the same pages.

FAQ: Downloading webpages with PHP & cURL

These are common questions people ask when building PHP web crawlers and download scripts.

How do I download HTML from a URL in PHP?

Use PHP’s cURL extension to send an HTTP GET request and return the response body. cURL is preferred over file_get_contents() when you need status codes, headers, cookies, or timeouts.

How can I get the HTTP status code with cURL in PHP?

After curl_exec(), call curl_getinfo($ch) and read http_code. This tutorial’s function returns status_code directly.

How do I follow redirects (301/302) in PHP cURL?

Enable CURLOPT_FOLLOWLOCATION and set CURLOPT_MAXREDIRS to prevent loops. Also capture the final URL via curl_getinfo().

Why is my cURL request blocked (403 or 429)?

Many sites rate-limit or block automated requests. Use a clear user agent, slow down your crawl, implement retries with backoff, and confirm you’re allowed to access the pages.

When should I use Selenium instead of cURL?

Use Selenium when the data you need is rendered by JavaScript after the initial HTML loads (or when you need to interact with the page like clicking or scrolling).

Want a crawler that stays running?

If you need ongoing collection, monitoring, and structured delivery, we build managed web crawlers for teams that need reliability.

David Selden-Treiman, Director of Operations at Potent Pages.

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Web Crawler Industries

There are a lot of uses of web crawlers across industries to generate strategic advantages and alpha. Industries benefiting from web crawlers include:

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

Scroll To Top