How to create a polite PHP web crawler using robot.txt.

Creating a Polite PHP Web Crawler: Checking robots.txt

May 31, 2018 | By David Selden-Treiman | Filed in: php.

Checking a Website’s robots.txt Using PHP

When downoading sites, it’s important to make sure that you have permission from the site owner to download pages. This is generally indicated by the robots.txt file. This file specifies which pages crawlers aren’t allowed to download. It can also specify pages to download (although that’s mainly for the sitemap.xml), the minimum requested crawl time delay, and other settings. Specifications can also be separated by crawler “user agent” name.

We will be working from our previous PHP web spider tutorial, extending it to check the robots.txt file of a site and make sure our crawler is allowed to access a specified page. To do this, we will be using the Robots.txt Parser Class PHP library. It will allow us to load in a robots.txt file, set our user agent, and check individual pages to see if we are allowed to download them or not.

First up, we’re going to start with our PHP site web crawler from before:
$mysql_host = ''; $mysql_username = ''; $mysql_password = ''; $mysql_database = 'phpCrawlerTutorial'; $mysql_conn = mysqli_connect( $mysql_host, $mysql_username, $mysql_password, $mysql_database ); if ( !$mysql_conn ) { echo "Error: Unable to connect to MySQL." . PHP_EOL; echo "Debugging errno: " . mysqli_connect_errno() . PHP_EOL; echo "Debugging error: " . mysqli_connect_error() . PHP_EOL; exit; } /** * Download a Webpage via the HTTP GET Protocol using libcurl * Developed By: Potent Pages, LLC (https://potentpages.com/) */ function _http ( $target, $referer ) { //Initialize Handle $handle = curl_init(); //Define Settings curl_setopt ( $handle, CURLOPT_HTTPGET, true ); curl_setopt ( $handle, CURLOPT_HEADER, true ); curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" ); curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" ); curl_setopt ( $handle, CURLOPT_USERAGENT, "web-crawler-tutorial-test" ); curl_setopt ( $handle, CURLOPT_URL, $target ); curl_setopt ( $handle, CURLOPT_REFERER, $referer ); curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true ); curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 ); curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true ); //Execute Request $output = curl_exec ( $handle ); //Close cURL handle curl_close ( $handle ); //Separate Header and Body $separator = "\r\n\r\n"; $header = substr( $output, 0, strpos( $output, $separator ) ); $body_start = strlen( $header ) + strlen( $separator ); $body = substr( $output, $body_start, strlen( $output ) - $body_start ); //Parse Headers $header_array = Array(); foreach ( explode ( "\r\n", $header ) as $i => $line ) { if($i === 0) { $header_array['http_code'] = $line; $status_info = explode( " ", $line ); $header_array['status_info'] = $status_info; } else { list ( $key, $value ) = explode ( ': ', $line ); $header_array[$key] = $value; } } //Form Return Structure $ret = Array("headers" => $header_array, "body" => $body ); return $ret; } /** * Convert Relative to Absolute URL * Developed By: Potent Pages, LLC (https://potentpages.com/) * From: https://potentpages.com/web-crawler-development/tutorials/php/simple-php-web-spider * Based On: https://stackoverflow.com/questions/4444475/transfrom-relative-path-into-absolute-url-using-php */ function relativeToAbsolute( $relative, $base ) { if($relative == "" || $base == "") return ""; //Check Base $base_parsed = parse_url($base); if( !array_key_exists( 'scheme', $base_parsed ) || !array_key_exists( 'host', $base_parsed ) || !array_key_exists( 'path', $base_parsed ) ) { echo "Base Path \"$base\" Not Absolute Link\n"; return ""; } //Parse Relative $relative_parsed = parse_url($relative); //If relative URL already has a scheme, it's already absolute if( array_key_exists( 'scheme', $relative_parsed ) && $relative_parsed['scheme'] != '' ) { return $relative; } //If only a query or a fragment, return base (without any fragment or query) + relative if( !array_key_exists( 'scheme', $relative_parsed ) && !array_key_exists( 'host', $relative_parsed ) && !array_key_exists( 'path', $relative_parsed ) ) { return $base_parsed['scheme']. '://'. $base_parsed['host']. $base_parsed['path']. $relative; } //Remove non-directory portion from path $path = preg_replace( '#/[^/]*$#', '', $base_parsed['path'] ); //If relative path already points to root, remove base return absolute path if( $relative[0] == '/' ) { $path = ''; } //Working Absolute URL $abs = ''; //If user in URL if( array_key_exists( 'user', $base_parsed ) ) { $abs .= $base_parsed['user']; //If password in URL as well if( array_key_exists( 'pass', $base_parsed ) ) { $abs .= ':'. $base_parsed['pass']; } //Append location prefix $abs .= '@'; } //Append Host $abs .= $base_parsed['host']; //If port in URL if( array_key_exists( 'port', $base_parsed ) ) { $abs .= ':'. $base_parsed['port']; } //Append New Relative Path $abs .= $path. '/'. $relative; //Replace any '//' or '/./' or '/foo/../' with '/' $regex = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#'); for( $n=1; $n>0; $abs = preg_replace( $regex, '/', $abs, -1, $n ) ) {} //Return Absolute URL return $base_parsed['scheme']. '://'. $abs; } function parsePage( $target, $referer ) { global $mysql_conn; //Parse URL and get Components $url_components = parse_url( $target ); if($url_components === false) { die( 'Unable to Parse URL' ); } $url_host = $url_components['host']; $url_path = ''; if(array_key_exists( 'path', $url_components ) == false) { //If not a valid path, mark as done $query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()"; if( !mysqli_query($mysql_conn, $query) ) { die( "Error: Unable to perform Download Time Update Query (path)\n" ); } return false; } else { $url_path = $url_components['path']; } //Download Page echo "Downloading: $target\n"; $contents = _http ( $target, $referer ); echo "Done\n"; //Check Status if( $contents['headers']['status_info'][1] != 200 ) { //If not ok, mark as downloaded but skip $query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()"; if( !mysqli_query($mysql_conn, $query) ) { die( "Error: Unable to perform Download Time Update Query (http status)\n" ); } return false; } //Parse Contents $doc = new DOMDocument(); libxml_use_internal_errors( true ); $doc->loadHTML( $contents['body'] ); //Get title $title = ''; $titleTags = $doc->getElementsByTagName('title'); if( count( $titleTags ) > 0 ) { $title = mysqli_real_escape_string( $mysql_conn, $titleTags[0]->nodeValue ); } //Get Description $description = ''; $metaTags = $doc->getElementsByTagName('meta'); foreach( $metaTags as $tag ) { if( $tag->getAttribute('name') == 'description' ) { $description = mysqli_real_escape_string( $mysql_conn, $tag->getAttribute( 'content' ) ); } } //Get first h1 $h1 = ''; $h1Tags = $doc->getElementsByTagName('h1'); if( count( $h1Tags ) > 0 ) { $h1 = mysqli_real_escape_string( $mysql_conn, $h1Tags[0]->nodeValue ); } //Insert/Update Page Data $query = "INSERT INTO pages (path, title, description, h1, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", \"$title\", \"$description\", \"$h1\", NOW()) ON DUPLICATE KEY UPDATE title=\"$title\", description=\"$description\", h1=\"$h1\", download_time=NOW()"; if( !mysqli_query($mysql_conn, $query) ) { die( "Error: Unable to perform Insert Query\n" ); } //Get Links $links = Array(); $link_tags = $doc->getElementsByTagName( 'a' ); foreach( $link_tags as $tag ) { if( ($href_value = $tag->getAttribute( 'href' ))) { $link_absolute = relativeToAbsolute( $href_value, $target ); $link_parsed = parse_url( $link_absolute ); if($link_parsed === null || $link_parsed === false) { die( 'Unable to Parse Link URL' ); } if(( !array_key_exists( 'host', $link_parsed ) || $link_parsed['host'] == "" || $link_parsed['host'] == $url_host ) && array_key_exists( 'path', $link_parsed ) && $link_parsed['path'] != "" && array_search( $link_parsed['path'], $links ) === false) { $links[] = $link_parsed['path']; } } } //Insert Links foreach($links as $link) { $link_escaped = mysqli_real_escape_string( $mysql_conn, $link ); $query = "INSERT IGNORE INTO pages (path, referer, download_time) VALUES (\"$link_escaped\", \"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NULL)"; if( !mysqli_query($mysql_conn, $query) ) { die( "Error: Unable to perform Insert Link Value Query\n" ); } } return true; } //Define Seed Settings $seed_url = "https://potentpages.com/"; $seed_components = parse_url( $seed_url ); if($seed_components === false) { die( 'Unable to Seed Parse URL' ); } $seed_scheme = $seed_components['scheme']; $seed_host = $seed_components['host']; $url_start = $seed_scheme. '://'. $seed_host; //Download Seed URL parsePage( $seed_url, "" ); //Loop through all pages on site. while(1) { $counter = 0; $select_query = "SELECT * FROM pages WHERE download_time IS NULL"; if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) { if( ($rowCount = mysqli_num_rows($select_result)) > 0 ) { for( $i = 0; $i < $rowCount; $i++ ) { if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) { $path = $row['path']; $referer = $row['referer']; //Check if first character isn't a '/' if( $path[0] != '/' ) { continue; } $path = $row['path']; $referer = $row['referer']; if( parsePage( $url_start. $path, $referer ) ) { $counter++; } sleep(1); } } } else { break; } } else { die( "Unable to select un-downloaded pages\n" ); } if($counter == 0) { break; } }

Installing the Library

On our site, we’re going to need to install the PHP library. To do this, we will use composer. If you don’t already have composer, please install Composer using these instructions.

After you have composer installed, you’ll need to run the following from your web crawler project’s folder:
composer require t1gor/robots-txt-parser;

This installs the package that we can call from our site crawler script. You should see something like:
Using version ^0.2.3 for t1gor/robots-txt-parser ./composer.json has been created Loading composer repositories with package information Updating dependencies (including require-dev) Package operations: 1 install, 0 updates, 0 removals - Installing t1gor/robots-txt-parser (v0.2.3): Downloading (100%) Writing lock file Generating autoload files

Loading the Library

To load the library, we will need to place the standard Composer autoload line at the top of the script
require 'vendor/autoload.php';

We will also need to specify our web crawler’s user agent name in a variable so that we can easily use/change it (we will need it multiple times now.)
//Crawler Settings $user_agent = "web-crawler-tutorial-test/1.0";

In our _http() function, we will need to change the CURLOPT_USERAGENT to $user_agent instead of a static string. At the top of the function though, we need to specify that we’re treating the $user_agent variable as global (so we can access it from within the function’s scope).
function _http ( $target, $referer ) { //Global Variables global $user_agent; //Initialize Handle //... }

… and replace the following line: …

curl_setopt ( $handle, CURLOPT_USERAGENT, $user_agent );

Initialization

In our code, we will need to initialize the robots.txt parser. To do this, we need to download the robots.txt file for the site and specify the HTTP response status code. We also need to check whether we can access the seed URL’s path. If not, we need to exit.

//Define Seed Settings
//...
//Initialize robots.txt File Check
$robots_txt_url = $url_start. "/robots.txt";
echo "Downloading: $robots_txt_url\n";
$robots_txt = _http($robots_txt_url, "");
$parser = new RobotsTxtParser($robots_txt['body']);
$parser->setHttpStatusCode($robots_txt['headers']['status_info'][1]);
$parser->setUserAgent($user_agent);
//Check if path is allowed
if( $parser->isDisallowed( $seed_components['path'] ) ) {
        die("Robots.txt: Disallowed Seed URL");
}
//Download Seed URL
//...

Checking URLs

At this point, we can start checking URLs within our while loop. To do this, we can just call the $parser->isDisallowed() function. Our new loop should look like:

//Loop through all pages on site.
while(1) {
        $counter = 0;
        $select_query = "SELECT * FROM pages WHERE download_time IS NULL";
        if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) {
                if( ($rowCount = mysqli_num_rows($select_result)) > 0 ) {
                        for( $i = 0; $i < $rowCount; $i++ ) {
                                if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) {
                                        $path = $row['path'];
                                        $referer = $row['referer']; 
                                        //Check if first character isn't a '/'
                                        if( $path[0] != '/' ) {
                                                continue;
                                        } 
                                        $path = $row['path'];
                                        $referer = $row['referer']; 
                                        //Check if we're allowed to download the page.
                                        if( $parser->isAllowed( $path ) ) {
                                                if( parsePage( $url_start. $path, $referer ) ) {
                                                        $counter++;
                                                }
                                                sleep(1);
                                        }
                                }
                         }
                 } else {
                         break;
                 }
         } else {
                 die( "Unable to select un-downloaded pages\n" );
         }
         if($counter == 0) {
                 break;
         }
}

The Final Script

At this point, you should now have a script that you can run on other sites and will verify whether it’s allowed to access the pages it wants to download:

<?php 
require 'vendor/autoload.php'; 

//Crawler Settings 
$user_agent = "web-crawler-tutorial-test/1.0"; 

//Connect to Database 
$mysql_host = ''; 
$mysql_username = ''; 
$mysql_password = ''; 
$mysql_database = 'phpCrawlerTutorial'; 
$mysql_conn = mysqli_connect( $mysql_host, $mysql_username, $mysql_password, $mysql_database ); 
if ( !$mysql_conn ) { 
        echo "Error: Unable to connect to MySQL." . PHP_EOL; 
        echo "Debugging errno: " . mysqli_connect_errno() . PHP_EOL; 
        echo "Debugging error: " . mysqli_connect_error() . PHP_EOL; 
        exit; 
} 

/** 
 * Download a Webpage via the HTTP GET Protocol using libcurl 
 * Developed By: Potent Pages, LLC (https://potentpages.com/) 
 */ 
function _http ( $target, $referer ) { 
        //Global Variables 
        global $user_agent; 

        //Initialize Handle 
        $handle = curl_init(); 

        //Define Settings 
        curl_setopt ( $handle, CURLOPT_HTTPGET, true ); 
        curl_setopt ( $handle, CURLOPT_HEADER, true ); 
        curl_setopt ( $handle, CURLOPT_COOKIEJAR, "cookie_jar.txt" ); 
        curl_setopt ( $handle, CURLOPT_COOKIEFILE, "cookies.txt" ); 
        curl_setopt ( $handle, CURLOPT_USERAGENT, $user_agent ); 
        curl_setopt ( $handle, CURLOPT_URL, $target ); 
        curl_setopt ( $handle, CURLOPT_REFERER, $referer ); 
        curl_setopt ( $handle, CURLOPT_FOLLOWLOCATION, true ); 
        curl_setopt ( $handle, CURLOPT_MAXREDIRS, 4 ); 
        curl_setopt ( $handle, CURLOPT_RETURNTRANSFER, true ); 

        //Execute Request 
        $output = curl_exec ( $handle ); 

        //Close cURL handle 
        curl_close ( $handle ); 

        //Separate Header and Body 
        $separator = "\r\n\r\n"; 
        $header = substr( $output, 0, strpos( $output, $separator ) ); 
        $body_start = strlen( $header ) + strlen( $separator ); 
        $body = substr( $output, $body_start, strlen( $output ) - $body_start ); 

        //Parse Headers 
        $header_array = Array(); 
        foreach ( explode ( "\r\n", $header ) as $i => $line ) {<br ?-->
                if($i === 0) {
                        $header_array['http_code'] = $line;
                        $status_info = explode( " ", $line );
                        $header_array['status_info'] = $status_info;
                } else {
                        list ( $key, $value ) = explode ( ': ', $line );
                        $header_array[$key] = $value;
                }
        }
        //Form Return Structure
        $ret = Array("headers" => $header_array, "body" => $body );
        return $ret;
}
/**
* Convert Relative to Absolute URL
* Developed By: Potent Pages, LLC (https://potentpages.com/)
* From: https://potentpages.com/web-crawler-development/tutorials/php/simple-php-web-spider
* Based On: https://stackoverflow.com/questions/4444475/transfrom-relative-path-into-absolute-url-using-php
*/
function relativeToAbsolute( $relative, $base )
{
        if($relative == "" || $base == "") return "";
        //Check Base
        $base_parsed = parse_url($base);
        if( !array_key_exists( 'scheme', $base_parsed ) || !array_key_exists( 'host', $base_parsed ) || !array_key_exists( 'path', $base_parsed ) ) {
                echo "Base Path \"$base\" Not Absolute Link\n";
                return "";
        }
        //Parse Relative
        $relative_parsed = parse_url($relative);
        //If relative URL already has a scheme, it's already absolute
        if( array_key_exists( 'scheme', $relative_parsed ) && $relative_parsed['scheme'] != '' ) {
                return $relative;
        }
        //If only a query or a fragment, return base (without any fragment or query) + relative
        if( !array_key_exists( 'scheme', $relative_parsed ) && !array_key_exists( 'host', $relative_parsed ) && !array_key_exists( 'path', $relative_parsed ) ) {
                return $base_parsed['scheme']. '://'. $base_parsed['host']. $base_parsed['path']. $relative;
        }
        //Remove non-directory portion from path
        $path = preg_replace( '#/[^/]*$#', '', $base_parsed['path'] );
        //If relative path already points to root, remove base return absolute path
        if( $relative[0] == '/' ) {
                $path = '';
        }
        //Working Absolute URL
        $abs = '';
        //If user in URL
        if( array_key_exists( 'user', $base_parsed ) ) {
                $abs .= $base_parsed['user'];
                //If password in URL as well
                if( array_key_exists( 'pass', $base_parsed ) ) {
                        $abs .= ':'. $base_parsed['pass'];
                }
                //Append location prefix
                $abs .= '@';
         }
         //Append Host
         $abs .= $base_parsed['host'];
         //If port in URL
         if( array_key_exists( 'port', $base_parsed ) ) {
                 $abs .= ':'. $base_parsed['port'];
         }
         //Append New Relative Path
         $abs .= $path. '/'. $relative;
         //Replace any '//' or '/./' or '/foo/../' with '/'
         $regex = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
         for( $n=1; $n>0; $abs = preg_replace( $regex, '/', $abs, -1, $n ) ) {}
         //Return Absolute URL
         return $base_parsed['scheme']. '://'. $abs;
}
function parsePage( $target, $referer ) {
        global $mysql_conn;
        //Parse URL and get Components
        $url_components = parse_url( $target );
        if($url_components === false) {
                die( 'Unable to Parse URL' );
        }
        $url_host = $url_components['host'];
        $url_path = '';
        if(array_key_exists( 'path', $url_components ) == false) {
                //If not a valid path, mark as done
                $query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
                if( !mysqli_query($mysql_conn, $query) ) {
                        die( "Error: Unable to perform Download Time Update Query (path)\n" );
                }
                return false;
        } else {
                $url_path = $url_components['path'];
        }
        //Download Page
        echo "Downloading: $target\n";
        $contents = _http ( $target, $referer );
        echo "Done\n";
        //Check Status
        if( $contents['headers']['status_info'][1] != 200 ) {
                //If not ok, mark as downloaded but skip
                $query = "INSERT INTO pages (path, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", NOW()) ON DUPLICATE KEY UPDATE download_time=NOW()";
                if( !mysqli_query($mysql_conn, $query) ) {
                        die( "Error: Unable to perform Download Time Update Query (http status)\n" );
                }
                return false;
        }
        //Parse Contents
        $doc = new DOMDocument();
        libxml_use_internal_errors( true );
        $doc->loadHTML( $contents['body'] );
        //Get title
        $title = '';
        $titleTags = $doc->getElementsByTagName('title');
        if( count( $titleTags ) > 0 ) {
                $title = mysqli_real_escape_string( $mysql_conn, $titleTags[0]->nodeValue );
        }
        //Get Description
        $description = '';
        $metaTags = $doc->getElementsByTagName('meta');
        foreach( $metaTags as $tag ) {
                if( $tag->getAttribute('name') == 'description' ) {
                        $description = mysqli_real_escape_string( $mysql_conn, $tag->getAttribute( 'content' ) );
                }
         }
         //Get first h1
         $h1 = '';
         $h1Tags = $doc->getElementsByTagName('h1');
         if( count( $h1Tags ) > 0 ) {
                 $h1 = mysqli_real_escape_string( $mysql_conn, $h1Tags[0]->nodeValue );
         }
         //Insert/Update Page Data
         $query = "INSERT INTO pages (path, title, description, h1, download_time) VALUES (\"". mysqli_real_escape_string( $mysql_conn, $url_path ). "\", \"$title\", \"$description\", \"$h1\", NOW()) ON DUPLICATE KEY UPDATE title=\"$title\", description=\"$description\", h1=\"$h1\", download_time=NOW()";
         if( !mysqli_query($mysql_conn, $query) ) {
                 die( "Error: Unable to perform Insert Query\n" );
         }
         //Get Links
         $links = Array();
         $link_tags = $doc->getElementsByTagName( 'a' );
         foreach( $link_tags as $tag ) {
                 if( ($href_value = $tag->getAttribute( 'href' ))) {
                         $link_absolute = relativeToAbsolute( $href_value, $target );
                         $link_parsed = parse_url( $link_absolute );
                         if($link_parsed === null || $link_parsed === false) {
                                 die( 'Unable to Parse Link URL' );
                         }
                         if(( !array_key_exists( 'host', $link_parsed ) || $link_parsed['host'] == "" || $link_parsed['host'] == $url_host ) && array_key_exists( 'path', $link_parsed ) && $link_parsed['path'] != "" && array_search( $link_parsed['path'], $links ) === false) {
                                $links[] = $link_parsed['path'];
                         }
                  }
          }
          //Insert Links
          foreach($links as $link) {
                  $link_escaped = mysqli_real_escape_string( $mysql_conn, $link );
                  $query = "INSERT IGNORE INTO pages (path, referer, download_time) VALUES (\"$link_escaped\", \"". mysqli_real_escape_string( $mysql_conn, $target ). "\", NULL)";
                  if( !mysqli_query($mysql_conn, $query) ) {
                          die( "Error: Unable to perform Insert Link Value Query\n" );
                  }
          }
          return true;
}
//Define Seed Settings
$seed_url = "https://potentpages.com/";
$seed_components = parse_url( $seed_url );
if($seed_components === false) {
        die( 'Unable to Seed Parse URL' );
}
$seed_scheme = $seed_components['scheme'];
$seed_host = $seed_components['host'];
$url_start = $seed_scheme. '://'. $seed_host;
//Initialize robots.txt File Check
$robots_txt_url = $url_start. "/robots.txt";
echo "Downloading: $robots_txt_url\n";
$robots_txt = _http($robots_txt_url, "");
$parser = new RobotsTxtParser($robots_txt['body']);
$parser->setHttpStatusCode($robots_txt['headers']['status_info'][1]);
$parser->setUserAgent($user_agent);
//Check if path is allowed
if( $parser->isDisallowed( $seed_components['path'] ) ) {
        die("Robots.txt: Disallowed Seed URL");
}
//Download Seed URL
parsePage( $seed_url, "" );
//Loop through all pages on site.
while(1) {
        $counter = 0;
        $select_query = "SELECT * FROM pages WHERE download_time IS NULL";
        if(($select_result = mysqli_query( $mysql_conn, $select_query )) !== false) {
                if( ($rowCount = mysqli_num_rows($select_result)) > 0 ) {
                        for( $i = 0; $i < $rowCount; $i++ ) {
                                if( ( $row = mysqli_fetch_assoc($select_result) ) !== false ) {
                                        $path = $row['path'];
                                        $referer = $row['referer']; 
                                        //Check if first character isn't a '/'
                                        if( $path[0] != '/' ) {
                                                continue;
                                        } 
                                        $path = $row['path'];
                                        $referer = $row['referer']; 
                                        //Check if we're allowed to download the page.
                                        if( $parser->isAllowed( $path ) ) {
                                                if( parsePage( $url_start. $path, $referer ) ) {
                                                        $counter++;
                                                }
                                                sleep(1);
                                        }
                                 }
                         }
                 } else {
                         break;
                 }
         } else {
                 die( "Unable to select un-downloaded pages\n" );
         }
         if($counter == 0) {
                 break;
         }
}

David Selden-Treiman

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Tags: cURL PHP7 Robots.txt Tutorial Web Crawler

Comments are closed here.

Creating a Polite PHP Web Crawler: Checking robots.txt

Checking a Website’s robots.txt Using PHP

Installing the Library

Loading the Library

Initialization

Checking URLs

The Final Script

Web Crawlers

Data Collection

Web Crawler Industries

Legality of Web Crawlers

Development

Building Your Own

GPT & Web Crawlers