Creating a Polite PHP Web Crawler: Checking robots.txt
May 31, 2018 | By David Selden-Treiman | Filed in: php.Checking a Website’s robots.txt Using PHP
When downoading sites, it’s important to make sure that you have permission from the site owner to download pages. This is generally indicated by the robots.txt file. This file specifies which pages crawlers aren’t allowed to download. It can also specify pages to download (although that’s mainly for the sitemap.xml), the minimum requested crawl time delay, and other settings. Specifications can also be separated by crawler “user agent” name.
We will be working from our previous PHP web spider tutorial, extending it to check the robots.txt file of a site and make sure our crawler is allowed to access a specified page. To do this, we will be using the Robots.txt Parser Class PHP library. It will allow us to load in a robots.txt file, set our user agent, and check individual pages to see if we are allowed to download them or not.
First up, we’re going to start with our PHP site web crawler from before:
Installing the Library
On our site, we’re going to need to install the PHP library. To do this, we will use composer. If you don’t already have composer, please install Composer using these instructions.
After you have composer installed, you’ll need to run the following from your web crawler project’s folder:
This installs the package that we can call from our site crawler script. You should see something like:
Loading the Library
To load the library, we will need to place the standard Composer autoload line at the top of the script
We will also need to specify our web crawler’s user agent name in a variable so that we can easily use/change it (we will need it multiple times now.)
In our _http() function, we will need to change the CURLOPT_USERAGENT to $user_agent instead of a static string. At the top of the function though, we need to specify that we’re treating the $user_agent variable as global (so we can access it from within the function’s scope).
… and replace the following line: …
Initialization
In our code, we will need to initialize the robots.txt parser. To do this, we need to download the robots.txt file for the site and specify the HTTP response status code. We also need to check whether we can access the seed URL’s path. If not, we need to exit.
Checking URLs
At this point, we can start checking URLs within our while loop. To do this, we can just call the $parser->isDisallowed() function. Our new loop should look like:
The Final Script
At this point, you should now have a script that you can run on other sites and will verify whether it’s allowed to access the pages it wants to download:
David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.
Comments are closed here.