Creating a Simple PHP Web CrawlerMay 24, 2018 | By PotentPages | Filed in: php.
How to write a simple PHP web crawler to download an entire website
Do you want to automatically capture an information like the score of your favorite sport, latest fashion style and trend from the stock market from a website for extra processing? If the specific information you need is available on a website, you can write a simple web crawler and extract the data that you need.
Creating a web crawler allows you to turn data from one format into another, more useful one. We can download content from a website, extract the content we’re looking for, and save it into a structured, easily accessed format (like a database.)
To do this, we will use the following flow:
- Decide which pages to download
- Download pages individually
- Store the page sources in files
- Parse the pages to store the content we’re looking for into PHP variable(s)
- Store the extracted data into a database
Please Note: Only use this script on a page you have permission to use it on. This script doesn’t have any checks for the site’s robots.txt file, so it’s important to make sure that you do this check manually. We will implement this in our next tutorial:
Our objective here will be to download and store the title and first h1 tag of every page we can find on a domain. To do this, we are going to have to build what’s commonly referred to as a “spider”. A spider traverses the “web” from page to page, identifying new pages as it goes along via links.
When we’re done, your script should look like:
Step 1: Setup & Connect to Database
We first will need to connect to our database that we will use to store our page data. We will need a database with the following structure:
The first operation we need to do in our script is to connect to the database. Please replace the $mysql_host, $mysql_username, and $mysql_password with the settings for your own MySQL server.
Step 2: Create a Function to Download Webpages
We will be using our function “_http()” to download webpages individually. Please copy and paste the code below. You can see a full description of how the function works in our tutorial Downloading a Webpage using PHP and cURL.
Step 3: Parsing Our Page:
Before we parse our page, we will need a function to convert relative links into absolute ones. To do this, we can use the following function (just copy and paste it into your script):
We now need to get the content that we want (the title, meta-description, and the first H1 tag) as well as the links from the page so that we can know what new links to add to our table.
First up, we will create a new function:
We will need to parse the components of our target URL. To do this, we will use the parse_url function. We’re looking to get the ‘path’ of the URL, so we will save that index. We will also need the “host” value later.
We will now need to call our _http() function.
We need to check to make sure that the page returned a “200 OK” status. To do this, we will use:
Now, we need to structure the HTML in our page. To do this, we will use the DOMDocument class. We will create a new DOMDocument object and use it to parse our HTML (body) contents. We also want to specify that we don’t want the HTML parsing errors passed through to the PHP error output (it just clutters up the terminal/output).
At this point, with our content structured, we can extract the content that we’re looking for. This part will depend on your own project, but for now, we’ll extract the title tag, the description tag, and the first H1 tag and save the info in the database table. To do this, we will use the DOMDocument::getElementsByTagName function. We will extract the elements into an array, and if a tag exists, we will save its value.
For the description, we will need to get all of the meta tags, and parse them to check if their “name” attribute is equal to “description”.
Obtaining the value of the first h1 tag will follow the same pattern as the title tag:
Now that we have the data for our page, we will need to insert it into the database if it doesn’t exist, or update it if it already does.
Next up, we need to get all of the links from our downloaded page. To do this, we will use the same getElementsByTagName function, this time obtaining all of the “a” tags. We will need to extract all of the “href” attributes, just like we did when we obtained the description tags. But first, we need to convert the link into an absolute one using our relativeToAbsolute() function. We will need to also use an array_search() function to ensure that we aren’t duplicating links in our array.
Now, we need to insert each link into the database if it doesn’t already exist. Before we do, we need to escape the path though. We also need to include the referring URL (the $target variable passed to this function) in order to know what to use as the referrer when we download the next page.
Finally, we can return true, indicating that the page download was successful.
Step 4: Traversing the Site
Before we do anything else, we need to define our starting point (our seed URL.) This will be the main URL of the site, e.g. http://domain.com/ , or another URL of your choosing. We will also need to parse the URL to get the scheme (http or https), and the host of the URL. This will define our base for the rest of our URLs. Combining these together will make the start of our URLs.
We can also start the download of our URL seed URL automatically too.
At this point, we just need to loop through all of the pages on the site. To do this, we need to have a master loop that will keep looping until we tell it to break from the cycle.
Next, we will need to place a select query and corresponding executing code inside of the loop that will select all rows with a NULL download time (indicating that they haven’t been loaded yet). We will need the target and the referrer. We can then form the full seed URL, and run the parsing function. Additionally, we need to have a counter to check if we are loading invalid links that aren’t being updated for one reason or another.
VERY IMPORTANT! We also need to introduce a delay between the downloading to make sure we don’t burden the hosting web server and get the IP address banned. I recommend a delay of at least 5 seconds, preferably 10.
The Final Script
The final script you can use should look like: