Web Crawlers and ProxiesJuly 16, 2019 | By admin | Filed in: web-crawler-development.
What is a proxy?
A proxy is basically a program on another computer that takes your network traffic and sends it on to it’s destination. It acts as a middle point so that the website you’re attempting to download won’t see your own IP address.
Why Would You Use a Proxy for Web Crawlers?
Proxies are very useful in certain circumstances. Some websites will block all download requests from some IP address ranges. If you need to access a website, a proxy will let you do so anonymously. This is especially true if you’re attempting to crawl a website from a commercial IP address range (like a cloud platform).
Side Note: All of the IP addresses on the internet (at least the IPv4 ones) are allocated to different organizations, and this is public information. This allows people to see which hosting provider, ISP, or other organization owns the IP address block that you’re crawling from. If you’re attempting to crawl from an IP address owned by a hosting provider, the owner of the website can see that your IP address is from a public provider instead from a domestic ISP (a server farm instead of a residential address.
Most websites won’t do this filtering on their own. Instead, they will use a service like Cloudflare that monitors for proxy-like traffic patterns and allows website owners to block IP addresses that don’t look like regular users.
Another consideration is that the IP address of a resource can show your country, region (state, province, etc.), and even your city. Some people will attempt to block crawlers and visitors based on the location their IP address shows. Alternatively, some people will show different content to people from different locations. As a result, you may want to download a website from multiple IP addresses to get an accurate representation of how the site appears worldwide.
How Do I Use a Proxy with a PHP Web Crawler?
Using a proxy is relatively simple, at least when using cURL in PHP. To specify the IP address of your proxy, you can use the CURLOPT_PROXY option, and to specify the port of the proxy, you can use the CURLOPT_PROXYPORT option. Your sample code might look like:
Where Do I Get a List of Proxies I Can Use?
To get a list of free HTTP proxies, you can use a number of resources.
Internally at Potent Pages, we maintain a list of free proxies, and keep it updated with their status (e.g. whether the proxy is up or not, and whether it gets blocked). We will be publishing this list soon.
A simple google search for “free http proxy” will also return lists of free proxies. Some of the lists are paid and some are free. However, you’ll most likely want to combine the lists and have the ability to check that your proxies are actually working before you make a bunch of requests from them (in our own experience).