Web Crawlers and Proxies: How to Use Proxies with PHP Web Crawlers

Shield Your IP Address: Web Crawlers and Proxies

July 16, 2019 | By David Selden-Treiman | Filed in: web-crawler-development.

What is a proxy?

A proxy is basically a program on another computer that takes your network traffic and sends it on to it’s destination. It acts as a middle point so that the website you’re attempting to download won’t see your own IP address.

Why Would You Use a Proxy for Web Crawlers?

Proxies are very useful in certain circumstances. Some websites will block all download requests from some IP address ranges. If you need to access a website, a proxy will let you do so anonymously. This is especially true if you’re attempting to crawl a website from a commercial IP address range (like a cloud platform).

Side Note: All of the IP addresses on the internet (at least the IPv4 ones) are allocated to different organizations, and this is public information. This allows people to see which hosting provider, ISP, or other organization owns the IP address block that you’re crawling from. If you’re attempting to crawl from an IP address owned by a hosting provider, the owner of the website can see that your IP address is from a public provider instead from a domestic ISP (a server farm instead of a residential address.

Most websites won’t do this filtering on their own. Instead, they will use a service like Cloudflare that monitors for proxy-like traffic patterns and allows website owners to block IP addresses that don’t look like regular users.

Another consideration is that the IP address of a resource can show your country, region (state, province, etc.), and even your city. Some people will attempt to block crawlers and visitors based on the location their IP address shows. Alternatively, some people will show different content to people from different locations. As a result, you may want to download a website from multiple IP addresses to get an accurate representation of how the site appears worldwide.

How Do I Use a Proxy with a PHP Web Crawler?

Using a proxy is relatively simple, at least when using cURL in PHP. To specify the IP address of your proxy, you can use the CURLOPT_PROXY option, and to specify the port of the proxy, you can use the CURLOPT_PROXYPORT option. Your sample code might look like:

curl_setopt($handle, CURLOPT_PROXY, $proxy_ipAddress);
curl_setopt($handle, CURLOPT_PROXYPORT, $proxy_port);

Where Do I Get a List of Proxies I Can Use?

To get a list of free HTTP proxies, you can use a number of resources.

Internally at Potent Pages, we maintain a list of free proxies, and keep it updated with their status (e.g. whether the proxy is up or not, and whether it gets blocked). We will be publishing this list soon.

A simple google search for “free http proxy” will also return lists of free proxies. Some of the lists are paid and some are free. However, you’ll most likely want to combine the lists and have the ability to check that your proxies are actually working before you make a bunch of requests from them (in our own experience).

David Selden-Treiman

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Tags: proxy Web Crawler

Comments are closed here.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Web Crawler Industries

There are a lot of uses of web crawlers across industries. Industries benefiting from web crawlers include:

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

Hedge Funds & Custom Data

Custom Data For Hedge Funds

Developing and testing hypotheses is essential for hedge funds. Custom data can be one of the best tools to do this.

There are many types of custom data for hedge funds, as well as many ways to get it.

Implementation

There are many different types of financial firms that can benefit from custom data. These include macro hedge funds, as well as hedge funds with long, short, or long-short equity portfolios.

Leading Indicators

Developing leading indicators is essential for predicting movements in the equities markets. Custom data is a great way to help do this.

Web Crawler Pricing

How Much Does a Web Crawler Cost?

A web crawler costs anywhere from:

nothing for open source crawlers,
$30-$500+ for commercial solutions, or
hundreds or thousands of dollars for custom crawlers.

Factors Affecting Web Crawler Project Costs

There are many factors that affect the price of a web crawler. While the pricing models have changed with the technologies available, ensuring value for money with your web crawler is essential to a successful project.

When planning a web crawler project, make sure that you avoid common misconceptions about web crawler pricing.

Web Crawler Expenses

There are many factors that affect the expenses of web crawlers. In addition to some of the hidden web crawler expenses, it's important to know the fundamentals of web crawlers to get the best success on your web crawler development.

If you're looking to hire a web crawler developer, the hourly rates range from:

entry-level developers charging $20-40/hr,
mid-level developers with some experience at $60-85/hr,
to top-tier experts commanding $100-200+/hr.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.