Web Crawler Development Techniques

July 15, 2019 | By David Selden-Treiman | Filed in: php.

Looking for some copy-and-paste PHP code for quickly creating a web crawler? Here is some code we use at Potent Pages to create web crawlers.

Extract Links from a Page in PHP

To extract the links from the HTML you’ve downloaded, you can use the DOMDocument class. We’ll parse the contents stored in a variable, and extract out all of the anchor tags (links with a tags).

This code will loop through all of the links in your HTML document. It will save the link location in the $link_href variable and the text of the link in the $link_text variable.

Extract Images from a Webpage in PHP

To extract the images from HTML, we will use a similar code set to the link extractor above. We will use the DOMDocument class to extract all of the img tags from the document and loop through them.

Extracting JSON from a Webpage in PHP

To extract JSON from a webpage, we’ll use a bit of a different technique. To do this, we’ll search for any patterns that look like JSON.

This technique is often very useful when dealing with e-commerce sites. Online stores tend to provide their information to the browser in JSON format for easier parsing. Therefore, to extract out product information from their website, often the easiest way is to just pull the JSON.

To identify patterns that match JSON, we will use a regular expression. PHP has a function called preg_match_all where we can look for a pattern and return all matches. Here’s some sample code:

Avoid Getting Blocked

If you’re constantly getting blocked when you try crawling a website, you may need to use a free proxy. To use a free proxy, you’ll want to add the following code to your cURL request. Make sure you replace the $proxy_ipAddress with the IP address of your proxy, and $proxy_port with the port of your proxy.

