Give us a call: (800) 252-6164
Web Crawler Development Techniques

Web Crawler Development Techniques

July 15, 2019 | By David Selden-Treiman | Filed in: php.

Looking for some copy-and-paste PHP code for quickly creating a web crawler? Here is some code we use at Potent Pages to create web crawlers.

A friendly note: if you want to change the appearance of the code boxes, you can click “theme” on the right side of your browser. It’ll allow you to pick from a range of code theme options.

Extract Links from a Page in PHP

To extract the links from the HTML you’ve downloaded, you can use the DOMDocument class. We’ll parse the contents stored in a variable, and extract out all of the anchor tags (links with a tags).

This code will loop through all of the links in your HTML document. It will save the link location in the $link_href variable and the text of the link in the $link_text variable.

Extract Images from a Webpage in PHP

To extract the images from HTML, we will use a similar code set to the link extractor above. We will use the DOMDocument class to extract all of the img tags from the document and loop through them.

Extracting JSON from a Webpage in PHP

To extract JSON from a webpage, we’ll use a bit of a different technique. To do this, we’ll search for any patterns that look like JSON.

This technique is often very useful when dealing with e-commerce sites. Online stores tend to provide their information to the browser in JSON format for easier parsing. Therefore, to extract out product information from their website, often the easiest way is to just pull the JSON.

To identify patterns that match JSON, we will use a regular expression. PHP has a function called preg_match_all where we can look for a pattern and return all matches. Here’s some sample code:

Avoid Getting Blocked

If you’re constantly getting blocked when you try crawling a website, you may need to use a free proxy. To use a free proxy, you’ll want to add the following code to your cURL request. Make sure you replace the $proxy_ipAddress with the IP address of your proxy, and $proxy_port with the port of your proxy.

David Selden-Treiman, Director of Operations at Potent Pages.

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.


Tags:

Comments are closed here.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Web Crawler Industries

There are a lot of uses of web crawlers across industries. Industries benefiting from web crawlers include:

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

Scroll To Top