Give us a call: (800) 252-6164
How to create a simple PHP web crawler to download a website

Creating a Simple PHP Web Crawler

May 24, 2018 | By David Selden-Treiman | Filed in: php.

How to write a simple PHP web crawler to download an entire website

 

Do you want to automatically capture an information like the score of your favorite sport, latest fashion style and trend from the stock market from a website for extra processing? If the specific information you need is available on a website, you can write a simple web crawler and extract the data that you need.

The Plan

Creating a web crawler allows you to turn data from one format into another, more useful one. We can download content from a website, extract the content we’re looking for, and save it into a structured, easily accessed format (like a database.)

To do this, we will use the following flow:

  1. Decide which pages to download
  2. Download pages individually
  3. Store the page sources in files
  4. Parse the pages to store the content we’re looking for into PHP variable(s)
  5. Store the extracted data into a database
  6. Repeat

Please Note: Only use this script on a page you have permission to use it on. This script doesn’t have any checks for the site’s robots.txt file, so it’s important to make sure that you do this check manually. We will implement this in our next tutorial:

Our Goal

Our objective here will be to download and store the title and first h1 tag of every page we can find on a domain. To do this, we are going to have to build what’s commonly referred to as a “spider”. A spider traverses the “web” from page to page, identifying new pages as it goes along via links.

A Preview

When we’re done, your script should look like:

Step 1: Setup & Connect to Database

We first will need to connect to our database that we will use to store our page data. We will need a database with the following structure:

The first operation we need to do in our script is to connect to the database. Please replace the $mysql_host, $mysql_username, and $mysql_password with the settings for your own MySQL server.

Step 2: Create a Function to Download Webpages

We will be using our function “_http()” to download webpages individually. Please copy and paste the code below. You can see a full description of how the function works in our tutorial Downloading a Webpage using PHP and cURL.

Step 3: Parsing Our Page:

Before we parse our page, we will need a function to convert relative links into absolute ones. To do this, we can use the following function (just copy and paste it into your script):

We now need to get the content that we want (the title, meta-description, and the first H1 tag) as well as the links from the page so that we can know what new links to add to our table.

First up, we will create a new function:

We will need to parse the components of our target URL. To do this, we will use the parse_url function. We’re looking to get the ‘path’ of the URL, so we will save that index. We will also need the “host” value later.

We will now need to call our _http() function.

 

We need to check to make sure that the page returned a “200 OK” status. To do this, we will use:

Now, we need to structure the HTML in our page. To do this, we will use the DOMDocument class. We will create a new DOMDocument object and use it to parse our HTML (body) contents. We also want to specify that we don’t want the HTML parsing errors passed through to the PHP error output (it just clutters up the terminal/output).

 

At this point, with our content structured, we can extract the content that we’re looking for. This part will depend on your own project, but for now, we’ll extract the title tag, the description tag, and the first H1 tag and save the info in the database table. To do this, we will use the DOMDocument::getElementsByTagName function. We will extract the elements into an array, and if a tag exists, we will save its value.

 

For the description, we will need to get all of the meta tags, and parse them to check if their “name” attribute is equal to “description”.

 

Obtaining the value of the first h1 tag will follow the same pattern as the title tag:

 

Now that we have the data for our page, we will need to insert it into the database if it doesn’t exist, or update it if it already does.

 

Next up, we need to get all of the links from our downloaded page. To do this, we will use the same getElementsByTagName function, this time obtaining all of the “a” tags. We will need to extract all of the “href” attributes, just like we did when we obtained the description tags. But first, we need to convert the link into an absolute one using our relativeToAbsolute() function. We will need to also use an array_search() function to ensure that we aren’t duplicating links in our array.

 

Now, we need to insert each link into the database if it doesn’t already exist. Before we do, we need to escape the path though. We also need to include the referring URL (the $target variable passed to this function) in order to know what to use as the referrer when we download the next page.

 

Finally, we can return true, indicating that the page download was successful.

 

Step 4: Traversing the Site

Before we do anything else, we need to define our starting point (our seed URL.) This will be the main URL of the site, e.g. http://domain.com/ , or another URL of your choosing. We will also need to parse the URL to get the scheme (http or https), and the host of the URL. This will define our base for the rest of our URLs. Combining these together will make the start of our URLs.

 

We can also start the download of our URL seed URL automatically too.

 

At this point, we just need to loop through all of the pages on the site. To do this, we need to have a master loop that will keep looping until we tell it to break from the cycle.

 

Next, we will need to place a select query and corresponding executing code inside of the loop that will select all rows with a NULL download time (indicating that they haven’t been loaded yet). We will need the target and the referrer. We can then form the full seed URL, and run the parsing function.  Additionally, we need to have a counter to check if we are loading invalid links that aren’t being updated for one reason or another.

 

VERY IMPORTANT! We also need to introduce a delay between the downloading to make sure we don’t burden the hosting web server and get the IP address banned. I recommend a delay of at least 5 seconds, preferably 10.

 

The Final Script

The final script you can use should look like:

David Selden-Treiman, Director of Operations at Potent Pages.

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.


Tags:

Comments are closed here.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Web Crawler Industries

There are a lot of uses of web crawlers across industries. Industries benefiting from web crawlers include:

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

Scroll To Top