Web Crawler Development TechniquesJuly 15, 2019 | By admin | Filed in: php.
A friendly note: if you want to change the appearance of the code boxes, you can click “theme” on the right side of your browser. It’ll allow you to pick from a range of code theme options.
Extract Links from a Page in PHP
To extract the links from the HTML you’ve downloaded, you can use the DOMDocument class. We’ll parse the contents stored in a variable, and extract out all of the anchor tags (links with a tags).
This code will loop through all of the links in your HTML document. It will save the link location in the $link_href variable and the text of the link in the $link_text variable.
Extract Images from a Webpage in PHP
To extract the images from HTML, we will use a similar code set to the link extractor above. We will use the DOMDocument class to extract all of the img tags from the document and loop through them.
Extracting JSON from a Webpage in PHP
To extract JSON from a webpage, we’ll use a bit of a different technique. To do this, we’ll search for any patterns that look like JSON.
This technique is often very useful when dealing with e-commerce sites. Online stores tend to provide their information to the browser in JSON format for easier parsing. Therefore, to extract out product information from their website, often the easiest way is to just pull the JSON.
To identify patterns that match JSON, we will use a regular expression. PHP has a function called preg_match_all where we can look for a pattern and return all matches. Here’s some sample code:
Avoid Getting Blocked
If you’re constantly getting blocked when you try crawling a website, you may need to use a free proxy. To use a free proxy, you’ll want to add the following code to your cURL request. Make sure you replace the $proxy_ipAddress with the IP address of your proxy, and $proxy_port with the port of your proxy.