Give us a call: (800) 252-6164
How to create a polite PHP web crawler using robot.txt.

Creating a Polite PHP Web Crawler: Checking robots.txt

May 31, 2018 | By David Selden-Treiman | Filed in: php.

Checking a Website’s robots.txt Using PHP

When downoading sites, it’s important to make sure that you have permission from the site owner to download pages. This is generally indicated by the robots.txt file. This file specifies which pages crawlers aren’t allowed to download. It can also specify pages to download (although that’s mainly for the sitemap.xml), the minimum requested crawl time delay, and other settings. Specifications can also be separated by crawler “user agent” name.

We will be working from our previous PHP web spider tutorial, extending it to check the robots.txt file of a site and make sure our crawler is allowed to access a specified page. To do this, we will be using the Robots.txt Parser Class PHP library. It will allow us to load in a robots.txt file, set our user agent, and check individual pages to see if we are allowed to download them or not.

First up, we’re going to start with our PHP site web crawler from before:

Installing the Library

On our site, we’re going to need to install the PHP library. To do this, we will use composer. If you don’t already have composer, please install Composer using these instructions.

After you have composer installed, you’ll need to run the following from your web crawler project’s folder:

This installs the package that we can call from our site crawler script. You should see something like:

Loading the Library

To load the library, we will need to place the standard Composer autoload line at the top of the script

We will also need to specify our web crawler’s user agent name in a variable so that we can easily use/change it (we will need it multiple times now.)

In our _http() function, we will need to change the CURLOPT_USERAGENT to $user_agent instead of a static string. At the top of the function though, we need to specify that we’re treating the $user_agent variable as global (so we can access it from within the function’s scope).

… and replace the following line: …

Initialization

In our code, we will need to initialize the robots.txt parser. To do this, we need to download the robots.txt file for the site and specify the HTTP response status code. We also need to check whether we can access the seed URL’s path. If not, we need to exit.

Checking URLs

At this point, we can start checking URLs within our while loop. To do this, we can just call the $parser->isDisallowed() function. Our new loop should look like:

The Final Script

At this point, you should now have a script that you can run on other sites and will verify whether it’s allowed to access the pages it wants to download:

David Selden-Treiman, Director of Operations at Potent Pages.

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.


Tags:

Comments are closed here.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Web Crawler Industries

There are a lot of uses of web crawlers across industries. Industries benefiting from web crawlers include:

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

Scroll To Top