Web Crawler Development Techniques

July 15, 2019 | By David Selden-Treiman | Filed in: php.

Looking for some copy-and-paste PHP code for quickly creating a web crawler? Here is some code we use at Potent Pages to create web crawlers.

A friendly note: if you want to change the appearance of the code boxes, you can click “theme” on the right side of your browser. It’ll allow you to pick from a range of code theme options.

Extract Links from a Page in PHP

To extract the links from the HTML you’ve downloaded, you can use the DOMDocument class. We’ll parse the contents stored in a variable, and extract out all of the anchor tags (links with a tags).

This code will loop through all of the links in your HTML document. It will save the link location in the $link_href variable and the text of the link in the $link_text variable.

//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
$document->loadHTML($contents);
//Get all links
if($links = $document->getElementsByTagName('a')) {
    //Loop through all links
    foreach($links as $node) {
        //Get link location (href)
        $link_href = $node->getAttribute('href');
        //Get link text
        $link_text = $node->nodeValue;
    }
}

Extract Images from a Webpage in PHP

To extract the images from HTML, we will use a similar code set to the link extractor above. We will use the DOMDocument class to extract all of the img tags from the document and loop through them.

//Assuming your contents are in a variable called $contents
//New DOM Document
$document = new DOMDocument;
//Load HTML in $contents variable
$document->loadHTML($contents);
//Get all links
if($links = $document->getElementsByTagName('img')) {
    //Loop through all links
    foreach($links as $node) {
        //Get source of the image (src attribute)
        $img_src = $node->getAttribute('src');
        //Get alt text of the image (alt attribute)
        $img_alt = $node->getAttribute('alt');
    }
}

Extracting JSON from a Webpage in PHP

To extract JSON from a webpage, we’ll use a bit of a different technique. To do this, we’ll search for any patterns that look like JSON.

This technique is often very useful when dealing with e-commerce sites. Online stores tend to provide their information to the browser in JSON format for easier parsing. Therefore, to extract out product information from their website, often the easiest way is to just pull the JSON.

To identify patterns that match JSON, we will use a regular expression. PHP has a function called preg_match_all where we can look for a pattern and return all matches. Here’s some sample code:

//Assuming your contents are in a vairable called $contents
//Check if the JSON is valid
//Attempt to decode; return true for valid if no errors were found.
//Otherwise return false for an error
function checkIfJSONValid($t) {
    json_decode($t);
    if(json_last_error() == JSON_ERROR_NONE) {
        return true;
    }
    return false;
}
//Match all JSON and filter for valid JSON contents
$json_matches = Array();
$pattern = '/\{(?:[^{}|(?R))*\}/x';
preg_match_all($pattern, $contents, $json_matches);
$json_valid = array_filter($json_matches, 'checkIfJSONValid');
//Loop through all valid JSON strings
foreach( $json_valid as $json ) {
    //Decode JSON
    //Second parameter specifies to use an associative array for the decoded JSON data
    $data = json_decode($t, true);
    //JSON is now in an array in the $data variable
}

Avoid Getting Blocked

If you’re constantly getting blocked when you try crawling a website, you may need to use a free proxy. To use a free proxy, you’ll want to add the following code to your cURL request. Make sure you replace the $proxy_ipAddress with the IP address of your proxy, and $proxy_port with the port of your proxy.

curl_setopt($handle, CURLOPT_PROXY, $proxy_ipAddress);
curl_setopt($handle, CURLOPT_PROXYPORT, $proxy_port);

David Selden-Treiman

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Tags: PHP7 Web Crawler

Comments are closed here.

Web Crawler Development Techniques

Extract Links from a Page in PHP

Extract Images from a Webpage in PHP

Extracting JSON from a Webpage in PHP

Avoid Getting Blocked

Web Crawlers

Data Collection

Web Crawler Industries

Legality of Web Crawlers

Development

Building Your Own

GPT & Web Crawlers