How to create a simple PHP web crawler to download a website

Creating a Simple PHP Web Crawler

May 24, 2018 | By David Selden-Treiman | Filed in: php.

How to write a simple PHP web crawler to download an entire website

Do you want to automatically capture an information like the score of your favorite sport, latest fashion style and trend from the stock market from a website for extra processing? If the specific information you need is available on a website, you can write a simple web crawler and extract the data that you need.

The Plan

Creating a web crawler allows you to turn data from one format into another, more useful one. We can download content from a website, extract the content we’re looking for, and save it into a structured, easily accessed format (like a database.)

To do this, we will use the following flow:

Decide which pages to download
Download pages individually
Store the page sources in files
Parse the pages to store the content we’re looking for into PHP variable(s)
Store the extracted data into a database
Repeat

Please Note: Only use this script on a page you have permission to use it on. This script doesn’t have any checks for the site’s robots.txt file, so it’s important to make sure that you do this check manually. We will implement this in our next tutorial:

Our Goal

Our objective here will be to download and store the title and first h1 tag of every page we can find on a domain. To do this, we are going to have to build what’s commonly referred to as a “spider”. A spider traverses the “web” from page to page, identifying new pages as it goes along via links.

A Preview

When we’re done, your script should look like:

$mysql_host = '';
$mysql_username = '';
$mysql_password = '';
$mysql_database = 'phpCrawlerTutorial';
$mysql_conn = mysqli_connect( $mysql_host, $mysql_username, $mysql_password, $mysql_database );
if ( !$mysql_conn ) {
	echo "Error: Unable to connect to MySQL." . PHP_EOL;
	echo "Debugging errno: " . mysqli_connect_errno() . PHP_EOL;
	echo "Debugging error: " . mysqli_connect_error() . PHP_EOL;
	exit;
}
/**
 * Download a Webpage via the HTTP GET Protocol using libcurl
 * Developed By: Potent Pages, LLC (https://potentpages.com/)
 */
function _http ( $target, $referer ) {
	//Initialize Handle
	$handle = curl_init();
	//Define Settings
	curl_setopt ( $handle, CURLOPT_HTTPGET, true );
	curl_setopt ( $handle, CURLOPT_HEADER, true );
	curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
	curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
	curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
	curl_setopt ( $handle, CURLOPT_URL, $target );
	curl_setopt ( $handle, CURLOPT_REFERER, $referer );
	curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
	curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
	curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
	//Execute Request
	$output = curl_exec ( $handle );
	//Close cURL handle
	curl_close ( $handle );
	//Separate Header and Body
	$separator = "\r\n\r\n";
	$header = substr( $output, 0, strpos( $output, $separator ) );
	$body_start = strlen( $header ) + strlen( $separator );
	$body = substr( $output, $body_start, strlen( $output ) - $body_start );
	//Parse Headers
	$header_array = Array();
	foreach ( explode ( "\r\n", $header ) as $i => $line ) {
		if($i === 0) {
			$header_array['http_code'] = $line;
			$status_info = explode( " ", $line );
			$header_array['status_info'] = $status_info;
		} else {
			list ( $key, $value ) = explode ( ': ', $line );
			$header_array[$key] = $value;
		}
	}
	//Form Return Structure
	$ret = Array("headers" => $header_array, "body" => $body );
	return $ret;
}
/**
 * Convert Relative to Absolute URL
 * Developed By: Potent Pages, LLC (https://potentpages.com/)
 * From: https://potentpages.com/web-crawler-development/tutorials/php/simple-php-web-spider
 * Based On: https://stackoverflow.com/questions/4444475/transfrom-relative-path-into-absolute-url-using-php
 */
function relativeToAbsolute( $relative, $base )
{
	if($relative == "" || $base == "") return "";
	//Check Base
	$base_parsed = parse_url($base);
	if( !array_key_exists( 'scheme', $base_parsed ) || !array_key_exists( 'host', $base_parsed ) || !array_key_exists( 'path', $base_parsed ) ) {
		echo "Base Path \"$base\" Not Absolute Link\n";
		return "";
	}
	//Parse Relative
	$relative_parsed = parse_url($relative);
	//If relative URL already has a scheme, it's already absolute
	if( array_key_exists( 'scheme', $relative_parsed ) && $relative_parsed['scheme'] != '' ) {
		return $relative;
	}
	//If only a query or a fragment, return base (without any fragment or query) + relative
	if( !array_key_exists( 'scheme', $relative_parsed ) && !array_key_exists( 'host', $relative_parsed ) && !array_key_exists( 'path', $relative_parsed ) ) {
		return $base_parsed['scheme']. '://'. $base_parsed['host']. $base_parsed['path']. $relative;
	}
	//Remove non-directory portion from path
	$path = preg_replace( '#/[^/]*$#', '', $base_parsed['path'] );
	//If relative path already points to root, remove base return absolute path
	if( $relative[0] == '/' ) {
		$path = '';
	}
	//Working Absolute URL
	$abs = '';
	//If user in URL
	if( array_key_exists( 'user', $base_parsed ) ) {
		$abs .= $base_parsed['user'];
		//If password in URL as well
		if( array_key_exists( 'pass', $base_parsed ) ) {
			$abs .= ':'. $base_parsed['pass'];
		}
		//Append location prefix
		$abs .= '@';
	}
	//Append Host
	$abs .= $base_parsed['host'];
	//If port in URL
	if( array_key_exists( 'port', $base_parsed ) ) {
		$abs .= ':'. $base_parsed['port'];
	}
	//Append New Relative Path
	$abs .= $path. '/'. $relative;
	//Replace any '//' or '/./' or '/foo/../' with '/'
	$regex = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
	for( $n=1; $n>0; $abs = preg_replace( $regex, '/', $abs, -1, $n ) ) {}
	//Return Absolute URL
	return $base_parsed['scheme']. '://'. $abs;
}
function parsePage( $target, $referer ) {
	global $mysql_conn;
	//Parse URL and get Components
	$url_components = parse_url( $target );
	if($url_components === false) {
		die( 'Unable to Parse URL' );
	}
	$url_host = $url_components['host'];
	$url_path = '';
	if(array_key_exists( 'path', $url_components ) == false) {
		//If not a valid path, mark as done
		$query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
		if( !mysqli_query($mysql_conn, $query) ) {
			die( "Error: Unable to perform Download Time Update Query (path)\n" );
		}
		return false;
	} else {
		$url_path = $url_components['path'];
	}
	//Download Page
	echo "Downloading: $target\n";
	$contents = _http ( $target, $referer );
	echo "Done\n";
	//Check Status
	if( $contents['headers']['status_info'][1] != 200 ) {
		//If not ok, mark as downloaded but skip
		$query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
		if( !mysqli_query($mysql_conn, $query) ) {
			die( "Error: Unable to perform Download Time Update Query (http status)\n" );
		}
		return false;
	}
	//Parse Contents
	$doc = new DOMDocument();
	libxml_use_internal_errors( true );
	$doc->loadHTML( $contents['body'] );
	//Get title
	$title = '';
	$titleTags = $doc->getElementsByTagName('title');
	if( count( $titleTags ) > 0 ) {
		$title = mysqli_real_escape_string( $mysql_conn, $titleTags[0]->nodeValue );
	}
	//Get Description
	$description = '';
	$metaTags = $doc->getElementsByTagName('meta');
	foreach( $metaTags as $tag ) {
		if( $tag->getAttribute('name') == 'description' ) {
			$description = mysqli_real_escape_string( $mysql_conn, $tag->getAttribute( 'content' ) );
		}
	}
	//Get first h1
	$h1 = '';
	$h1Tags = $doc->getElementsByTagName('h1');
	if( count( $h1Tags ) > 0 ) {
		$h1 = mysqli_real_escape_string( $mysql_conn, $h1Tags[0]->nodeValue );
	}
	//Insert/Update Page Data
	$query = "INSERT INTO pages (path, title, description, h1, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", \"$title\", \"$description\", \"$h1\", NOW()) ON DUPLICATE KEY UPDATE title=\"$title\", description=\"$description\", h1=\"$h1\", download_time=NOW()";
	if( !mysqli_query($mysql_conn, $query) ) {
		die( "Error: Unable to perform Insert Query\n" );
	}
	//Get Links
	$links = Array();
	$link_tags = $doc->getElementsByTagName( 'a' );
	foreach( $link_tags as $tag ) {
		if( ($href_value = $tag->getAttribute( 'href' ))) {
			$link_absolute = relativeToAbsolute( $href_value, $target );
			$link_parsed = parse_url( $link_absolute );
			if($link_parsed === null || $link_parsed === false) {
				die( 'Unable to Parse Link URL' );
			}
			if(( !array_key_exists( 'host', $link_parsed ) || $link_parsed['host'] == "" || $link_parsed['host'] == $url_host ) && array_key_exists( 'path', $link_parsed ) && $link_parsed['path'] != "" && array_search( $link_parsed['path'], $links ) === false) {
				$links[] = $link_parsed['path'];
			}
		}
	}
	//Insert Links
	foreach($links as $link) {
		$link_escaped = mysqli_real_escape_string( $mysql_conn, $link );
		$query = "INSERT IGNORE INTO pages (path, referer, download_time) VALUES (\"$link_escaped\", \"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NULL)";
		if( !mysqli_query($mysql_conn, $query) ) {
			die( "Error: Unable to perform Insert Link Value Query\n" );
		}
	}
	return true;
}
//Define Seed Settings
$seed_url = "https://potentpages.com/";
$seed_components = parse_url( $seed_url );
if($seed_components === false) {
	die( 'Unable to Seed Parse URL' );
}
$seed_scheme = $seed_components['scheme'];
$seed_host = $seed_components['host'];
$url_start = $seed_scheme. '://'. $seed_host;
//Download Seed URL
parsePage( $seed_url, "" );
//Loop through all pages on site.
while(1) {
	$counter = 0;
	$select_query = "SELECT * FROM pages WHERE download_time IS NULL";
	if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) {
		if( ($rowCount = mysqli_num_rows($select_result)) > 0 ) {
			for( $i = 0; $i < $rowCount; $i++ ) {
				if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) {
					$path = $row['path'];
					$referer = $row['referer'];
					//Check if first character isn't a '/'
					if( $path[0] != '/' ) {
						continue;
					}
					$path = $row['path'];
					$referer = $row['referer'];
					if( parsePage( $url_start. $path, $referer ) ) {
						$counter++;
					}
					sleep(1);
				}
			}
		} else {
			break;
		}
	} else {
		die( "Unable to select un-downloaded pages\n" );
	}
	if($counter == 0) {
		break;
	}
}

Step 1: Setup & Connect to Database

We first will need to connect to our database that we will use to store our page data. We will need a database with the following structure:

CREATE DATABASE phpCrawlerTutorial;
CREATE TABLE `pages` (
  `path` varchar(512) NOT NULL,
  `referer` text,
  `title` text,
  `description` text,
  `h1` text,
  `download_time` timestamp NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
ALTER TABLE `pages`
  ADD PRIMARY KEY (`path`);
COMMIT;

The first operation we need to do in our script is to connect to the database. Please replace the $mysql_host, $mysql_username, and $mysql_password with the settings for your own MySQL server.

$mysql_host = '';
$mysql_username = '';
$mysql_password = '';
$mysql_database = 'phpCrawlerTutorial';
$mysql_conn = mysqli_connect( $mysql_host, $mysql_username, $mysql_password, $mysql_database );
if ( !$mysql_conn ) {
       echo "Error: Unable to connect to MySQL." . PHP_EOL;
       echo "Debugging errno: " . mysqli_connect_errno() . PHP_EOL;
       echo "Debugging error: " . mysqli_connect_error() . PHP_EOL;
       exit;
}

Step 2: Create a Function to Download Webpages

We will be using our function “_http()” to download webpages individually. Please copy and paste the code below. You can see a full description of how the function works in our tutorial Downloading a Webpage using PHP and cURL.

/**
 * Download a Webpage via the HTTP GET Protocol using libcurl
 * Developed By: Potent Pages, LLC (https://potentpages.com/)
 */
function _http ( $target, $referer ) {
	//Initialize Handle
	$handle = curl_init();
	//Define Settings
	curl_setopt ( $handle, CURLOPT_HTTPGET, true );
	curl_setopt ( $handle, CURLOPT_HEADER, true );
	curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
	curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
	curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
	curl_setopt ( $handle, CURLOPT_URL, $target );
	curl_setopt ( $handle, CURLOPT_REFERER, $referer );
	curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
	curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
	curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
	//Execute Request
	$output = curl_exec ( $handle );
	//Close cURL handle
	curl_close ( $handle );
	//Separate Header and Body
	$separator = "\r\n\r\n";
	$header = substr( $output, 0, strpos( $output, $separator ) );
	$body_start = strlen( $header ) + strlen( $separator );
	$body = substr( $output, $body_start, strlen( $output ) - $body_start );
	//Parse Headers
	$header_array = Array();
	foreach ( explode ( "\r\n", $header ) as $i => $line ) {
		if($i === 0) {
			$header_array['http_code'] = $line;
			$status_info = explode( " ", $line );
			$header_array['status_info'] = $status_info;
		} else {
			list ( $key, $value ) = explode ( ': ', $line );
			$header_array[$key] = $value;
		}
	}
	//Form Return Structure
	$ret = Array("headers" => $header_array, "body" => $body );
	return $ret;
}

Step 3: Parsing Our Page:

Before we parse our page, we will need a function to convert relative links into absolute ones. To do this, we can use the following function (just copy and paste it into your script):

/**
 * Convert Relative to Absolute URL
 * Developed By: Potent Pages, LLC (https://potentpages.com/)
 * From: https://potentpages.com/web-crawler-development/tutorials/php/creating-a-simple-website-crawler
 * Based On: https://stackoverflow.com/questions/4444475/transfrom-relative-path-into-absolute-url-using-php
 */
function relativeToAbsolute( $relative, $base )
{
	if($relative == "" || $base == "") return "";
	//Check Base
	$base_parsed = parse_url($base);
	if( !array_key_exists( 'scheme', $base_parsed ) || !array_key_exists( 'host', $base_parsed ) || !array_key_exists( 'path', $base_parsed ) ) {
		echo "Base Path \"$base\" Not Absolute Link\n";
		return "";
	}
	//Parse Relative
	$relative_parsed = parse_url($relative);
	//If relative URL already has a scheme, it's already absolute
	if( array_key_exists( 'scheme', $relative_parsed ) && $relative_parsed['scheme'] != '' ) {
		return $relative;
	}
	//If only a query or a fragment, return base (without any fragment or query) + relative
	if( !array_key_exists( 'scheme', $relative_parsed ) && !array_key_exists( 'host', $relative_parsed ) && !array_key_exists( 'path', $relative_parsed ) ) {
		return $base_parsed['scheme']. '://'. $base_parsed['host']. $base_parsed['path']. $relative;
	}
	//Remove non-directory portion from path
	$path = preg_replace( '#/[^/]*$#', '', $base_parsed['path'] );
	//If relative path already points to root, remove base return absolute path
	if( $relative[0] == '/' ) {
		$path = '';
	}
	//Working Absolute URL
	$abs = '';
	//If user in URL
	if( array_key_exists( 'user', $base_parsed ) ) {
		$abs .= $base_parsed['user'];
		//If password in URL as well
		if( array_key_exists( 'pass', $base_parsed ) ) {
			$abs .= ':'. $base_parsed['pass'];
		}
		//Append location prefix
		$abs .= '@';
	}
	//Append Host
	$abs .= $base_parsed['host'];
	//If port in URL
	if( array_key_exists( 'port', $base_parsed ) ) {
		$abs .= ':'. $base_parsed['port'];
	}
	//Append New Relative Path
	$abs .= $path. '/'. $relative;
	//Replace any '//' or '/./' or '/foo/../' with '/'
	$regex = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
	for( $n=1; $n>0; $abs = preg_replace( $regex, '/', $abs, -1, $n ) ) {}
	//Return Absolute URL
	return $base_parsed['scheme']. '://'. $abs;
}

We now need to get the content that we want (the title, meta-description, and the first H1 tag) as well as the links from the page so that we can know what new links to add to our table.

First up, we will create a new function:

function parsePage( $target, $referer ) {
        global $mysql_conn;
}

We will need to parse the components of our target URL. To do this, we will use the parse_url function. We’re looking to get the ‘path’ of the URL, so we will save that index. We will also need the “host” value later.

$url_components = parse_url( $target );
if($url_components) {
        die( 'Unable to Parse URL' );
}
$url_host = $url_components['host'];
$url_path = '';
if(array_key_exists( 'path', $url_components ) == false) {
        //If not a valid path, mark as done
        $query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
        if( !mysqli_query($mysql_conn, $query) ) {
                die( "Error: Unable to perform Download Time Update Query (path)\n" );
        }
        return false;
} else {
        $url_path = $url_components['path'];
}

We will now need to call our _http() function.

//Download Page
echo "Downloading: $target\n";
$contents = _http ( $target, $referer );
echo "Done\n";

We need to check to make sure that the page returned a “200 OK” status. To do this, we will use:

//Check Status
if( $contents['headers']['status'] != 200 ) {
         //If not ok, mark as downloaded but skip
         $query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
         if( !mysqli_query($mysql_conn, $query) ) {
                die( "Error: Unable to perform Download Time Update Query\n" );
         }
         return false;
}

Now, we need to structure the HTML in our page. To do this, we will use the DOMDocument class. We will create a new DOMDocument object and use it to parse our HTML (body) contents. We also want to specify that we don’t want the HTML parsing errors passed through to the PHP error output (it just clutters up the terminal/output).

$doc = new DOMDocument();
libxml_use_internal_errors( true );
$doc->loadHTML( $contents['body'] );

At this point, with our content structured, we can extract the content that we’re looking for. This part will depend on your own project, but for now, we’ll extract the title tag, the description tag, and the first H1 tag and save the info in the database table. To do this, we will use the DOMDocument::getElementsByTagName function. We will extract the elements into an array, and if a tag exists, we will save its value.

$title = '';
$titleTags = $doc->getElementsByTagName('title');
if( count( $titleTags ) > 0 ) {
        $title = mysqli_real_escape_string( $$mysql_conn, $titleTags[0]->nodeValue );
}

For the description, we will need to get all of the meta tags, and parse them to check if their “name” attribute is equal to “description”.

$description = '';
$metaTags = $doc->getElementsByTagName('meta');
foreach( $metaTags as $tag ) {
       if( $tag->getAttribute('name') == 'description' ) {
              $description = mysqli_real_escape_string( $$mysql_conn, $tag->getAttribute( 'content' ) );
        }
}

Obtaining the value of the first h1 tag will follow the same pattern as the title tag:

$h1 = '';
$h1Tags = $doc->getElementsByTagName('h1');
if( count( $h1Tags ) > 0 ) {
         $h1 = mysqli_real_escape_string( $$mysql_conn, $h1Tags[0]->nodeValue );
}

Now that we have the data for our page, we will need to insert it into the database if it doesn’t exist, or update it if it already does.

//Insert/Update Page Data
$query = "INSERT INTO pages (path, title, description, h1, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", \"$title\", \"$description\", \"$h1\", NOW()) ON DUPLICATE KEY UPDATE title=\"$title\", description=\"$description\", h1=\"$h1\", download_time=NOW()";
if( !mysqli_query($mysql_conn, $query) ) {
        die( "Error: Unable to perform Insert Query\n" );
}

Next up, we need to get all of the links from our downloaded page. To do this, we will use the same getElementsByTagName function, this time obtaining all of the “a” tags. We will need to extract all of the “href” attributes, just like we did when we obtained the description tags. But first, we need to convert the link into an absolute one using our relativeToAbsolute() function. We will need to also use an array_search() function to ensure that we aren’t duplicating links in our array.

//Get Links
$links = Array();
$link_tags = $doc->getElementsByTagName( 'a' );
foreach( $link_tags as $tag ) {
        if( ($href_value = $tag->getAttribute( 'href' ))) {
                $link_absolute = relativeToAbsolute( $href_value, $target );
                $link_parsed = parse_url( $link_absolute );
                if($link_parsed === false) {
                        die( 'Unable to Parse Link URL' );
                }
                if(( !array_key_exists( 'host', $link_parsed ) || $link_parsed['host'] == "" || $link_parsed['host'] == $url_host ) && array_key_exists( 'path', $link_parsed ) && $link_parsed['path'] != "" && array_search( $link_parsed['path'], $links ) === false) {
                        $links[] = $link_parsed['path'];
              }
        }
}

Now, we need to insert each link into the database if it doesn’t already exist. Before we do, we need to escape the path though. We also need to include the referring URL (the $target variable passed to this function) in order to know what to use as the referrer when we download the next page.

//Insert Links
foreach($links as $link) {
        $link_escaped = mysqli_real_escape_string( $mysql_conn, $link );
        $query = "INSERT IGNORE INTO pages (path, referer, download_time) VALUES (\"$link_escaped\", \"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NULL)";
        if( !mysqli_query($mysql_conn, $query) ) {
                die( "Error: Unable to perform Insert Link Value Query\n" );
        }
}

Finally, we can return true, indicating that the page download was successful.

return true;

Step 4: Traversing the Site

Before we do anything else, we need to define our starting point (our seed URL.) This will be the main URL of the site, e.g. http://domain.com/ , or another URL of your choosing. We will also need to parse the URL to get the scheme (http or https), and the host of the URL. This will define our base for the rest of our URLs. Combining these together will make the start of our URLs.

//Define Seed Settings
$seed_url = "https://potentpages.com/";
$seed_components = parse_url( $seed_url );
if($seed_components === false) {
        die( 'Unable to Seed Parse URL' );
}
$seed_scheme = $seed_components['scheme'];
$seed_host = $seed_components['host'];
$url_start = $seed_scheme. '://'. $seed_host;

We can also start the download of our URL seed URL automatically too.

//Download Seed URL
parsePage( $seed_url, "" );

At this point, we just need to loop through all of the pages on the site. To do this, we need to have a master loop that will keep looping until we tell it to break from the cycle.

//Loop through all pages on site.
while(1)
{
}

Next, we will need to place a select query and corresponding executing code inside of the loop that will select all rows with a NULL download time (indicating that they haven’t been loaded yet). We will need the target and the referrer. We can then form the full seed URL, and run the parsing function. Additionally, we need to have a counter to check if we are loading invalid links that aren’t being updated for one reason or another.

VERY IMPORTANT! We also need to introduce a delay between the downloading to make sure we don’t burden the hosting web server and get the IP address banned. I recommend a delay of at least 5 seconds, preferably 10.

$counter = 0;
$select_query = "SELECT * FROM pages WHERE download_time IS NULL";
if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) {
	if( ($rowCount = mysqli_num_rows($select_result)) > 0 ) {
		for( $i = 0; $i < $rowCount; $i++ ) {
			if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) {
				//Check if first character isn't a '/'
				if( $path[0] != '/' ) {
					continue;
				}
				$path = $row['path'];
				$referer = $row['referer'];
				if( parsePage( $url_start. $path, $referer ) ) {
					$counter++;
				}
				sleep(10);
			}
		}
	} else {
		break;
	}
} else {
	die( "Unable to select un-downloaded pages\n" );
}
if($counter == 0) {
	break;
}

The Final Script

The final script you can use should look like:

$mysql_host = '';
$mysql_username = '';
$mysql_password = '';
$mysql_database = 'phpCrawlerTutorial';
$mysql_conn = mysqli_connect( $mysql_host, $mysql_username, $mysql_password, $mysql_database );
if ( !$mysql_conn ) {
	echo "Error: Unable to connect to MySQL." . PHP_EOL;
	echo "Debugging errno: " . mysqli_connect_errno() . PHP_EOL;
	echo "Debugging error: " . mysqli_connect_error() . PHP_EOL;
	exit;
}
/**
 * Download a Webpage via the HTTP GET Protocol using libcurl
 * Developed By: Potent Pages, LLC (https://potentpages.com/)
 */
function _http ( $target, $referer ) {
	//Initialize Handle
	$handle = curl_init();
	//Define Settings
	curl_setopt ( $handle, CURLOPT_HTTPGET, true );
	curl_setopt ( $handle, CURLOPT_HEADER, true );
	curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" );
	curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" );
	curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" );
	curl_setopt ( $handle, CURLOPT_URL, $target );
	curl_setopt ( $handle, CURLOPT_REFERER, $referer );
	curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true );
	curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 );
	curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true );
	//Execute Request
	$output = curl_exec ( $handle );
	//Close cURL handle
	curl_close ( $handle );
	//Separate Header and Body
	$separator = "\r\n\r\n";
	$header = substr( $output, 0, strpos( $output, $separator ) );
	$body_start = strlen( $header ) + strlen( $separator );
	$body = substr( $output, $body_start, strlen( $output ) - $body_start );
	//Parse Headers
	$header_array = Array();
	foreach ( explode ( "\r\n", $header ) as $i => $line ) {
		if($i === 0) {
			$header_array['http_code'] = $line;
			$status_info = explode( " ", $line );
			$header_array['status_info'] = $status_info;
		} else {
			list ( $key, $value ) = explode ( ': ', $line );
			$header_array[$key] = $value;
		}
	}
	//Form Return Structure
	$ret = Array("headers" => $header_array, "body" => $body );
	return $ret;
}
/**
 * Convert Relative to Absolute URL
 * Developed By: Potent Pages, LLC (https://potentpages.com/)
 * From: https://potentpages.com/web-crawler-development/tutorials/php/simple-php-web-spider
 * Based On: https://stackoverflow.com/questions/4444475/transfrom-relative-path-into-absolute-url-using-php
 */
function relativeToAbsolute( $relative, $base )
{
	if($relative == "" || $base == "") return "";
	//Check Base
	$base_parsed = parse_url($base);
	if( !array_key_exists( 'scheme', $base_parsed ) || !array_key_exists( 'host', $base_parsed ) || !array_key_exists( 'path', $base_parsed ) ) {
		echo "Base Path \"$base\" Not Absolute Link\n";
		return "";
	}
	//Parse Relative
	$relative_parsed = parse_url($relative);
	//If relative URL already has a scheme, it's already absolute
	if( array_key_exists( 'scheme', $relative_parsed ) && $relative_parsed['scheme'] != '' ) {
		return $relative;
	}
	//If only a query or a fragment, return base (without any fragment or query) + relative
	if( !array_key_exists( 'scheme', $relative_parsed ) && !array_key_exists( 'host', $relative_parsed ) && !array_key_exists( 'path', $relative_parsed ) ) {
		return $base_parsed['scheme']. '://'. $base_parsed['host']. $base_parsed['path']. $relative;
	}
	//Remove non-directory portion from path
	$path = preg_replace( '#/[^/]*$#', '', $base_parsed['path'] );
	//If relative path already points to root, remove base return absolute path
	if( $relative[0] == '/' ) {
		$path = '';
	}
	//Working Absolute URL
	$abs = '';
	//If user in URL
	if( array_key_exists( 'user', $base_parsed ) ) {
		$abs .= $base_parsed['user'];
		//If password in URL as well
		if( array_key_exists( 'pass', $base_parsed ) ) {
			$abs .= ':'. $base_parsed['pass'];
		}
		//Append location prefix
		$abs .= '@';
	}
	//Append Host
	$abs .= $base_parsed['host'];
	//If port in URL
	if( array_key_exists( 'port', $base_parsed ) ) {
		$abs .= ':'. $base_parsed['port'];
	}
	//Append New Relative Path
	$abs .= $path. '/'. $relative;
	//Replace any '//' or '/./' or '/foo/../' with '/'
	$regex = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
	for( $n=1; $n>0; $abs = preg_replace( $regex, '/', $abs, -1, $n ) ) {}
	//Return Absolute URL
	return $base_parsed['scheme']. '://'. $abs;
}
function parsePage( $target, $referer ) {
	global $mysql_conn;
	//Parse URL and get Components
	$url_components = parse_url( $target );
	if($url_components === false) {
		die( 'Unable to Parse URL' );
	}
	$url_host = $url_components['host'];
	$url_path = '';
	if(array_key_exists( 'path', $url_components ) == false) {
		//If not a valid path, mark as done
		$query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
		if( !mysqli_query($mysql_conn, $query) ) {
			die( "Error: Unable to perform Download Time Update Query (path)\n" );
		}
		return false;
	} else {
		$url_path = $url_components['path'];
	}
	//Download Page
	echo "Downloading: $target\n";
	$contents = _http ( $target, $referer );
	echo "Done\n";
	//Check Status
	if( $contents['headers']['status_info'][1] != 200 ) {
		//If not ok, mark as downloaded but skip
		$query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
		if( !mysqli_query($mysql_conn, $query) ) {
			die( "Error: Unable to perform Download Time Update Query (http status)\n" );
		}
		return false;
	}
	//Parse Contents
	$doc = new DOMDocument();
	libxml_use_internal_errors( true );
	$doc->loadHTML( $contents['body'] );
	//Get title
	$title = '';
	$titleTags = $doc->getElementsByTagName('title');
	if( count( $titleTags ) > 0 ) {
		$title = mysqli_real_escape_string( $mysql_conn, $titleTags[0]->nodeValue );
	}
	//Get Description
	$description = '';
	$metaTags = $doc->getElementsByTagName('meta');
	foreach( $metaTags as $tag ) {
		if( $tag->getAttribute('name') == 'description' ) {
			$description = mysqli_real_escape_string( $mysql_conn, $tag->getAttribute( 'content' ) );
		}
	}
	//Get first h1
	$h1 = '';
	$h1Tags = $doc->getElementsByTagName('h1');
	if( count( $h1Tags ) > 0 ) {
		$h1 = mysqli_real_escape_string( $mysql_conn, $h1Tags[0]->nodeValue );
	}
	//Insert/Update Page Data
	$query = "INSERT INTO pages (path, title, description, h1, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", \"$title\", \"$description\", \"$h1\", NOW()) ON DUPLICATE KEY UPDATE title=\"$title\", description=\"$description\", h1=\"$h1\", download_time=NOW()";
	if( !mysqli_query($mysql_conn, $query) ) {
		die( "Error: Unable to perform Insert Query\n" );
	}
	//Get Links
	$links = Array();
	$link_tags = $doc->getElementsByTagName( 'a' );
	foreach( $link_tags as $tag ) {
		if( ($href_value = $tag->getAttribute( 'href' ))) {
			$link_absolute = relativeToAbsolute( $href_value, $target );
			$link_parsed = parse_url( $link_absolute );
			if($link_parsed === null || $link_parsed === false) {
				die( 'Unable to Parse Link URL' );
			}
			if(( !array_key_exists( 'host', $link_parsed ) || $link_parsed['host'] == "" || $link_parsed['host'] == $url_host ) && array_key_exists( 'path', $link_parsed ) && $link_parsed['path'] != "" && array_search( $link_parsed['path'], $links ) === false) {
				$links[] = $link_parsed['path'];
			}
		}
	}
	//Insert Links
	foreach($links as $link) {
		$link_escaped = mysqli_real_escape_string( $mysql_conn, $link );
		$query = "INSERT IGNORE INTO pages (path, referer, download_time) VALUES (\"$link_escaped\", \"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NULL)";
		if( !mysqli_query($mysql_conn, $query) ) {
			die( "Error: Unable to perform Insert Link Value Query\n" );
		}
	}
	return true;
}
//Define Seed Settings
$seed_url = "https://potentpages.com/";
$seed_components = parse_url( $seed_url );
if($seed_components === false) {
	die( 'Unable to Seed Parse URL' );
}
$seed_scheme = $seed_components['scheme'];
$seed_host = $seed_components['host'];
$url_start = $seed_scheme. '://'. $seed_host;
//Download Seed URL
parsePage( $seed_url, "" );
//Loop through all pages on site.
while(1) {
	$counter = 0;
	$select_query = "SELECT * FROM pages WHERE download_time IS NULL";
	if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) {
		if( ($rowCount = mysqli_num_rows($select_result)) > 0 ) {
			for( $i = 0; $i < $rowCount; $i++ ) {
				if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) {
					$path = $row['path'];
					$referer = $row['referer'];
					//Check if first character isn't a '/'
					if( $path[0] != '/' ) {
						continue;
					}
					$path = $row['path'];
					$referer = $row['referer'];
					if( parsePage( $url_start. $path, $referer ) ) {
						$counter++;
					}
					sleep(1);
				}
			}
		} else {
			break;
		}
	} else {
		die( "Unable to select un-downloaded pages\n" );
	}
	if($counter == 0) {
		break;
	}
}

David Selden-Treiman

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Tags: cURL PHP7 Tutorial Web Crawler

Comments are closed here.

Creating a Simple PHP Web Crawler

How to write a simple PHP web crawler to download an entire website

The Plan

Our Goal

A Preview

Step 1: Setup & Connect to Database

Step 2: Create a Function to Download Webpages

Step 3: Parsing Our Page:

Step 4: Traversing the Site

The Final Script

Web Crawlers

Data Collection

Web Crawler Industries

Legality of Web Crawlers

Development

Building Your Own

GPT & Web Crawlers