Give us a call: (800) 252-6164
How to Download a Webpage using PHP and cURL

Downloading a Webpage using PHP and cURL

May 24, 2018 | By David Selden-Treiman | Filed in: php.

How to download a webpage using PHP and cURL: an easy-to-use function.

There are a wide range of reasons to download webpages. From parsing and storing information, to checking the status of pages, to analyzing the link structure of a website, web crawlers are quite useful.

In this tutorial, we will be creating a function that will allow you to easily download a webpage using the HTTP GET protocol (e.g. how browsers download pages by default.) You can then call this function repeatedly to download multiple webpages.

A Preview

When we’re done, your script should look like:

<?php
/**
* Download a Webpage via the HTTP GET Protocol using libcurl
* Developed By: Potent Pages, LLC (https://potentpages.com/)
*/
function _http ( $target, $referer ) {
    //Initialize Handle
    $handle = curl_init();

    //Define Settings
    curl_setopt ( $handle, CURLOPT_HTTPGET, true );
    curl_setopt ( $handle, CURLOPT_HEADER, true );
    curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
    curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
    curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
    curl_setopt ( $handle, CURLOPT_URL, $target );
    curl_setopt ( $handle, CURLOPT_REFERER, $referer );
    curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
    curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );

    //Execute Request
    $output = curl_exec ( $handle );

    //Close cURL handle
    curl_close ( $handle );

    //Separate Header and Body
    $separator = "\r\n\r\n";
    $header = substr( $output, 0, strpos( $output, $separator ) );
    $body_start = strlen( $header ) + strlen( $separator );
    $body = substr( $output, $body_start, strlen( $output ) - $body_start );

    //Parse Headers
    $header_array = Array();
    foreach ( explode ( "\r\n", $header ) as $i =&gt; $line ) {
        if($i === 0) {
            $header_array['http_code'] = $line;
            $status_info = explode( " ", $line );
            $header_array['status_info'] = $status_info;
        } else {
            list ( $key, $value ) = explode ( ': ', $line );
            $header_array[$key] = $value;
        }
    }

    //Form Return Structure
    $ret = Array("headers" =&gt; $header_array, "body" =&gt; $body );
    return $ret;
}

$page = _http( "https://potentpages.com", "" );
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];

Creating a Function to Download Webpages

The first step to building our simple crawler will be to create a function that we can use to download webpages. We will be using the cURL library to make HTTP GET calls, loading our pages into memory. In general, we will initialize our cURL object, define some settings, and make the actual cURL call that will download the page. We will then return the page contents.

Starting Our Function:

First up, we will need to define our input parameters. We will need the location that we want to download, $target; and the referring URL that our link came from,
$referer.

function _http ( $target, $referer ) {
}

Initializing the cURL Object:

Next, within our function we need to initialize our cURL object. To do this, we will call the curl_init function and store the returned handle in the <i>$handle</i> variable.

$handle = curl_init();

Defining Some Settings:

Now, we need to define some settings. This will be done by calling the curl_setopt; function. We will need to set our function to use the GET HTTP protocol.

curl_setopt ( $handle, CURLOPT_HTTPGET, true );

We will need to include the header of the page, in case we follow a link to a page that returns a status other than 200 OK, like 404 not found, or 301/302 redirect, or a 50x server error.

curl_setopt ( $handle, CURLOPT_HEADER, true );

To help with traversing the site, we will want to set up the cookie jar and cookie file settings. These will store your cookies in a file.

curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );

We will need to define your user agent next. This is what identifies your crawler to a web server. You can set it to whatever you want, but I recommend setting it to something that identifies you or your project.

curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );

Next, we need to set the URL that we want to download.

curl_setopt ( $handle, CURLOPT_URL, $target );

Now, we set the referring URL.

curl_setopt ( $handle, CURLOPT_REFERER, $referer );

In the event that we find a 30x redirect code, we will want to follow it instead of having to follow it manually.

curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );

Finally for our settings, we will want to return the final transfer into a variable.

curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );

Now, we can initialize our download:

$output = curl_exec ( $handle );

Once we have our data, we can close our cURL handle.

curl_close ( $handle );

Now that we have our page, we need to separate the header and the body of the request. The two sections are separated by a “\r\n\r\n” string, so we can separate them by identifying the location of the first instance of that string.

$separator = "\r\n\r\n";
$header = substr( $output, 0, strpos( $output, $separator ) );
$body_start = strlen( $header ) + strlen( $separator );
$body = substr( $output, $body_start, strlen( $output ) - $body_start );

Next up, we can parse the headers into an associateve array. There’s a fair amount of code here, but basically, it parses the header section line by line, and either takes the first line and splits it by ” ” to get the status information, or splits it by [name]: [information] to get the rest of the data.

$header_array = Array();
    foreach ( explode ( "\r\n", $header ) as $i =&gt; $line ) {
        if($i === 0) {
            $header_array['http_code'] = $line;
            $status_info = explode( " ", $line );
            $header_array['status_info'] = $status_info;
        } else {
            list ( $key, $value ) = explode ( ': ', $line );
            $header_array[$key] = $value;
    }
}

Finally for this function, we will form our data for returning it outside of the function

$ret = Array("headers" =&gt; $header_array, "body" =&gt; $body );
return $ret;

The Final Function

In the end, your function should look like this:

<?php
/**
* Download a Webpage via the HTTP GET Protocol using libcurl
* Developed By: Potent Pages, LLC (https://potentpages.com/)
*/
function _http ( $target, $referer ) {
    //Initialize Handle
    $handle = curl_init();

    //Define Settings
    curl_setopt ( $handle, CURLOPT_HTTPGET, true );
    curl_setopt ( $handle, CURLOPT_HEADER, true );
    curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
    curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
    curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
    curl_setopt ( $handle, CURLOPT_URL, $target );
    curl_setopt ( $handle, CURLOPT_REFERER, $referer );
    curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
    curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );

    //Execute Request
    $output = curl_exec ( $handle );

    //Close cURL handle
    curl_close ( $handle );

    //Separate Header and Body
    $separator = "\r\n\r\n";
    $header = substr( $output, 0, strpos( $output, $separator ) );
    $body_start = strlen( $header ) + strlen( $separator );
    $body = substr( $output, $body_start, strlen( $output ) - $body_start );

    //Parse Headers
    $header_array = Array();
    foreach ( explode ( "\r\n", $header ) as $i =&gt; $line ) {
        if($i === 0) {
            $header_array['http_code'] = $line;
            $status_info = explode( " ", $line );
            $header_array['status_info'] = $status_info;
        } else {
            list ( $key, $value ) = explode ( ': ', $line );
            $header_array[$key] = $value;
        }
    }

    //Form Return Structure
    $ret = Array("headers" =&gt; $header_array, "body" =&gt; $body );
    return $ret;
}

Calling our Webpage Download Function

To call the function, you can just call the _http() function. Provide the URL you want to download in the first parameter, and then the referring URL in the second parameter (you can leave this blank within quotes if you want to). For example:

$page = _http( "https://potentpages.com", "" );

This will provide you the data for the page, as well as the HTTP headers.

$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];

You’ll know that your page was successfully downloaded if your page returns a “200” value in the $http_status_code value.

The Script Code

Here’s the code for the script, in full:

<?php
/**
* Download a Webpage via the HTTP GET Protocol using libcurl
* Developed By: Potent Pages, LLC (https://potentpages.com/)
*/
function _http ( $target, $referer ) {
    //Initialize Handle
    $handle = curl_init();

    //Define Settings
    curl_setopt ( $handle, CURLOPT_HTTPGET, true );
    curl_setopt ( $handle, CURLOPT_HEADER, true );
    curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
    curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
    curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
    curl_setopt ( $handle, CURLOPT_URL, $target );
    curl_setopt ( $handle, CURLOPT_REFERER, $referer );
    curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
    curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );

    //Execute Request
    $output = curl_exec ( $handle );

    //Close cURL handle
    curl_close ( $handle );

    //Separate Header and Body
    $separator = "\r\n\r\n";
    $header = substr( $output, 0, strpos( $output, $separator ) );
    $body_start = strlen( $header ) + strlen( $separator );
    $body = substr( $output, $body_start, strlen( $output ) - $body_start );

    //Parse Headers
    $header_array = Array();
    foreach ( explode ( "\r\n", $header ) as $i =&gt; $line ) {
        if($i === 0) {
            $header_array['http_code'] = $line;
            $status_info = explode( " ", $line );
            $header_array['status_info'] = $status_info;
        } else {
            list ( $key, $value ) = explode ( ': ', $line );
            $header_array[$key] = $value;
        }
    }

    //Form Return Structure
    $ret = Array("headers" =&gt; $header_array, "body" =&gt; $body );
    return $ret;
}

$page = _http( "https://potentpages.com", "" );
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];

Expanding the Crawler

If you want to expand your web crawler a bit, we use the function in our next tutorial to download an entire website: Creating a Simple Website Crawler.

See you there!

David Selden-Treiman, Director of Operations at Potent Pages.

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.


Tags:

One comment on “Downloading a Webpage using PHP and cURL

  1. […] We will be using our function “_http()” to download webpages individually. Please copy and paste the code below. You can see a full description of how the function works in our tutorial Downloading a Webpage using PHP and cURL. […]

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Web Crawler Industries

There are a lot of uses of web crawlers across industries. Industries benefiting from web crawlers include:

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

Scroll To Top
AI Chat ×
?