Downloading a Webpage using PHP and cURL
May 24, 2018 | By David Selden-Treiman | Filed in: php.How to download a webpage using PHP and cURL: an easy-to-use function.
There are a wide range of reasons to download webpages. From parsing and storing information, to checking the status of pages, to analyzing the link structure of a website, web crawlers are quite useful.
In this tutorial, we will be creating a function that will allow you to easily download a webpage using the HTTP GET protocol (e.g. how browsers download pages by default.) You can then call this function repeatedly to download multiple webpages.
A Preview
When we’re done, your script should look like:
<?php
/**
* Download a Webpage via the HTTP GET Protocol using libcurl
* Developed By: Potent Pages, LLC (https://potentpages.com/)
*/
function _http ( $target, $referer ) {
//Initialize Handle
$handle = curl_init();
//Define Settings
curl_setopt ( $handle, CURLOPT_HTTPGET, true );
curl_setopt ( $handle, CURLOPT_HEADER, true );
curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
curl_setopt ( $handle, CURLOPT_URL, $target );
curl_setopt ( $handle, CURLOPT_REFERER, $referer );
curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
//Execute Request
$output = curl_exec ( $handle );
//Close cURL handle
curl_close ( $handle );
//Separate Header and Body
$separator = "\r\n\r\n";
$header = substr( $output, 0, strpos( $output, $separator ) );
$body_start = strlen( $header ) + strlen( $separator );
$body = substr( $output, $body_start, strlen( $output ) - $body_start );
//Parse Headers
$header_array = Array();
foreach ( explode ( "\r\n", $header ) as $i => $line ) {
if($i === 0) {
$header_array['http_code'] = $line;
$status_info = explode( " ", $line );
$header_array['status_info'] = $status_info;
} else {
list ( $key, $value ) = explode ( ': ', $line );
$header_array[$key] = $value;
}
}
//Form Return Structure
$ret = Array("headers" => $header_array, "body" => $body );
return $ret;
}
$page = _http( "https://potentpages.com", "" );
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];
Creating a Function to Download Webpages
The first step to building our simple crawler will be to create a function that we can use to download webpages. We will be using the cURL library to make HTTP GET calls, loading our pages into memory. In general, we will initialize our cURL object, define some settings, and make the actual cURL call that will download the page. We will then return the page contents.
Starting Our Function:
First up, we will need to define our input parameters. We will need the location that we want to download, $target; and the referring URL that our link came from,
$referer.
function _http ( $target, $referer ) {
}
Initializing the cURL Object:
Next, within our function we need to initialize our cURL object. To do this, we will call the curl_init function and store the returned handle in the <i>$handle</i> variable.
$handle = curl_init();
Defining Some Settings:
Now, we need to define some settings. This will be done by calling the curl_setopt; function. We will need to set our function to use the GET HTTP protocol.
curl_setopt ( $handle, CURLOPT_HTTPGET, true );
We will need to include the header of the page, in case we follow a link to a page that returns a status other than 200 OK, like 404 not found, or 301/302 redirect, or a 50x server error.
curl_setopt ( $handle, CURLOPT_HEADER, true );
To help with traversing the site, we will want to set up the cookie jar and cookie file settings. These will store your cookies in a file.
curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
We will need to define your user agent next. This is what identifies your crawler to a web server. You can set it to whatever you want, but I recommend setting it to something that identifies you or your project.
curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
Next, we need to set the URL that we want to download.
curl_setopt ( $handle, CURLOPT_URL, $target );
Now, we set the referring URL.
curl_setopt ( $handle, CURLOPT_REFERER, $referer );
In the event that we find a 30x redirect code, we will want to follow it instead of having to follow it manually.
curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
Finally for our settings, we will want to return the final transfer into a variable.
curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
Now, we can initialize our download:
$output = curl_exec ( $handle );
Once we have our data, we can close our cURL handle.
curl_close ( $handle );
Now that we have our page, we need to separate the header and the body of the request. The two sections are separated by a “\r\n\r\n” string, so we can separate them by identifying the location of the first instance of that string.
$separator = "\r\n\r\n";
$header = substr( $output, 0, strpos( $output, $separator ) );
$body_start = strlen( $header ) + strlen( $separator );
$body = substr( $output, $body_start, strlen( $output ) - $body_start );
Next up, we can parse the headers into an associateve array. There’s a fair amount of code here, but basically, it parses the header section line by line, and either takes the first line and splits it by ” ” to get the status information, or splits it by [name]: [information] to get the rest of the data.
$header_array = Array();
foreach ( explode ( "\r\n", $header ) as $i => $line ) {
if($i === 0) {
$header_array['http_code'] = $line;
$status_info = explode( " ", $line );
$header_array['status_info'] = $status_info;
} else {
list ( $key, $value ) = explode ( ': ', $line );
$header_array[$key] = $value;
}
}
Finally for this function, we will form our data for returning it outside of the function
$ret = Array("headers" => $header_array, "body" => $body );
return $ret;
The Final Function
In the end, your function should look like this:
<?php
/**
* Download a Webpage via the HTTP GET Protocol using libcurl
* Developed By: Potent Pages, LLC (https://potentpages.com/)
*/
function _http ( $target, $referer ) {
//Initialize Handle
$handle = curl_init();
//Define Settings
curl_setopt ( $handle, CURLOPT_HTTPGET, true );
curl_setopt ( $handle, CURLOPT_HEADER, true );
curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
curl_setopt ( $handle, CURLOPT_URL, $target );
curl_setopt ( $handle, CURLOPT_REFERER, $referer );
curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
//Execute Request
$output = curl_exec ( $handle );
//Close cURL handle
curl_close ( $handle );
//Separate Header and Body
$separator = "\r\n\r\n";
$header = substr( $output, 0, strpos( $output, $separator ) );
$body_start = strlen( $header ) + strlen( $separator );
$body = substr( $output, $body_start, strlen( $output ) - $body_start );
//Parse Headers
$header_array = Array();
foreach ( explode ( "\r\n", $header ) as $i => $line ) {
if($i === 0) {
$header_array['http_code'] = $line;
$status_info = explode( " ", $line );
$header_array['status_info'] = $status_info;
} else {
list ( $key, $value ) = explode ( ': ', $line );
$header_array[$key] = $value;
}
}
//Form Return Structure
$ret = Array("headers" => $header_array, "body" => $body );
return $ret;
}
Calling our Webpage Download Function
To call the function, you can just call the _http() function. Provide the URL you want to download in the first parameter, and then the referring URL in the second parameter (you can leave this blank within quotes if you want to). For example:
$page = _http( "https://potentpages.com", "" );
This will provide you the data for the page, as well as the HTTP headers.
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];
You’ll know that your page was successfully downloaded if your page returns a “200” value in the $http_status_code value.
The Script Code
Here’s the code for the script, in full:
<?php
/**
* Download a Webpage via the HTTP GET Protocol using libcurl
* Developed By: Potent Pages, LLC (https://potentpages.com/)
*/
function _http ( $target, $referer ) {
//Initialize Handle
$handle = curl_init();
//Define Settings
curl_setopt ( $handle, CURLOPT_HTTPGET, true );
curl_setopt ( $handle, CURLOPT_HEADER, true );
curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
curl_setopt ( $handle, CURLOPT_URL, $target );
curl_setopt ( $handle, CURLOPT_REFERER, $referer );
curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
//Execute Request
$output = curl_exec ( $handle );
//Close cURL handle
curl_close ( $handle );
//Separate Header and Body
$separator = "\r\n\r\n";
$header = substr( $output, 0, strpos( $output, $separator ) );
$body_start = strlen( $header ) + strlen( $separator );
$body = substr( $output, $body_start, strlen( $output ) - $body_start );
//Parse Headers
$header_array = Array();
foreach ( explode ( "\r\n", $header ) as $i => $line ) {
if($i === 0) {
$header_array['http_code'] = $line;
$status_info = explode( " ", $line );
$header_array['status_info'] = $status_info;
} else {
list ( $key, $value ) = explode ( ': ', $line );
$header_array[$key] = $value;
}
}
//Form Return Structure
$ret = Array("headers" => $header_array, "body" => $body );
return $ret;
}
$page = _http( "https://potentpages.com", "" );
$headers = $page['headers'];
$http_status_code = $headers['http_code'];
$body = $page['body'];
Expanding the Crawler
If you want to expand your web crawler a bit, we use the function in our next tutorial to download an entire website: Creating a Simple Website Crawler.
See you there!
David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.
[…] We will be using our function “_http()” to download webpages individually. Please copy and paste the code below. You can see a full description of how the function works in our tutorial Downloading a Webpage using PHP and cURL. […]