Web Crawler Development Techniques
July 15, 2019 | By David Selden-Treiman | Filed in: php.Looking for some copy-and-paste PHP code for quickly creating a web crawler? Here is some code we use at Potent Pages to create web crawlers.
A friendly note: if you want to change the appearance of the code boxes, you can click “theme” on the right side of your browser. It’ll allow you to pick from a range of code theme options.
Extract Links from a Page in PHP
To extract the links from the HTML you’ve downloaded, you can use the DOMDocument class. We’ll parse the contents stored in a variable, and extract out all of the anchor tags (links with a tags).
This code will loop through all of the links in your HTML document. It will save the link location in the $link_href variable and the text of the link in the $link_text variable.
Extract Images from a Webpage in PHP
To extract the images from HTML, we will use a similar code set to the link extractor above. We will use the DOMDocument class to extract all of the img tags from the document and loop through them.
Extracting JSON from a Webpage in PHP
To extract JSON from a webpage, we’ll use a bit of a different technique. To do this, we’ll search for any patterns that look like JSON.
This technique is often very useful when dealing with e-commerce sites. Online stores tend to provide their information to the browser in JSON format for easier parsing. Therefore, to extract out product information from their website, often the easiest way is to just pull the JSON.
To identify patterns that match JSON, we will use a regular expression. PHP has a function called preg_match_all where we can look for a pattern and return all matches. Here’s some sample code:
Avoid Getting Blocked
If you’re constantly getting blocked when you try crawling a website, you may need to use a free proxy. To use a free proxy, you’ll want to add the following code to your cURL request. Make sure you replace the $proxy_ipAddress with the IP address of your proxy, and $proxy_port with the port of your proxy.
David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.
Comments are closed here.