Downloading a Webpage using PHP and cURLMay 24, 2018 | By admin | Filed in: php.
How to download a webpage using PHP and cURL: an easy-to-use function.
There are a wide range of reasons to download webpages. From parsing and storing information, to checking the status of pages, to analyzing the link structure of a website, web crawlers are quite useful.
In this tutorial, we will be creating a function that will allow you to easily download a webpage using the HTTP GET protocol (e.g. how browsers download pages by default.) You can then call this function repeatedly to download multiple webpages.
When we’re done, your script should look like:
Creating a Function to Download Webpages
The first step to building our simple crawler will be to create a function that we can use to download webpages. We will be using the cURL library to make HTTP GET calls, loading our pages into memory. In general, we will initialize our cURL object, define some settings, and make the actual cURL call that will download the page. We will then return the page contents.
Starting Our Function:
First up, we will need to define our input parameters. We will need the location that we want to download, $target; and the referring URL that our link came from,
Initializing the cURL Object:
Next, within our function we need to initialize our cURL object. To do this, we will call the curl_init function and store the returned handle in the <i>$handle</i> variable.
Defining Some Settings:
Now, we need to define some settings. This will be done by calling the curl_setopt; function. We will need to set our function to use the GET HTTP protocol.
We will need to include the header of the page, in case we follow a link to a page that returns a status other than 200 OK, like 404 not found, or 301/302 redirect, or a 50x server error.
To help with traversing the site, we will want to set up the cookie jar and cookie file settings. These will store your cookies in a file.
We will need to define your user agent next. This is what identifies your crawler to a web server. You can set it to whatever you want, but I recommend setting it to something that identifies you or your project.
Next, we need to set the URL that we want to download.
Now, we set the referring URL.
In the event that we find a 30x redirect code, we will want to follow it instead of having to follow it manually.
Finally for our settings, we will want to return the final transfer into a variable.
Now, we can initialize our download:
Once we have our data, we can close our cURL handle.
Now that we have our page, we need to separate the header and the body of the request. The two sections are separated by a “\r\n\r\n” string, so we can separate them by identifying the location of the first instance of that string.
Next up, we can parse the headers into an associateve array. There’s a fair amount of code here, but basically, it parses the header section line by line, and either takes the first line and splits it by ” ” to get the status information, or splits it by [name]: [information] to get the rest of the data.
Finally for this function, we will form our data for returning it outside of the function
The Final Function
In the end, your function should look like this:
Calling our Webpage Download Function
To call the function, you can just call the _http() function. Provide the URL you want to download in the first parameter, and then the referring URL in the second parameter (you can leave this blank within quotes if you want to). For example:
This will provide you the data for the page, as well as the HTTP headers.
You’ll know that your page was successfully downloaded if your page returns a “200” value in the $http_status_code value.
The Script Code
Here’s the code for the script, in full:
Expanding the Crawler
If you want to expand your web crawler a bit, we use the function in our next tutorial to download an entire website: Creating a Simple Website Crawler.
See you there!