Give us a call: (800) 252-6164
Select your language

Website Crawler Tutorials

Whether you are looking to obtain data from a website, track changes on the internet, or use a website API, website crawlers are a great way to get the data you need. While they have many components, crawlers fundamentally use a simple process: download the raw data, process and extract it, and, if desired, store the data in a file or database. There are many ways to do this, and many languages you can build your spider or crawler in.

There are many libraries and add-ons that can make building a crawler easier. From building the HTML document object model (DOM) for easy traversal in order to make extracting content easier (Cheerio), to supporting the use of javascript-based queries to easily facilitate the use of browsers to control the crawlers (Node.js), building a web crawler doesn’t have to be hard.

These tutorials are arranged by subject and language/technology/libraries used. To view more tutorials for a particular area, just click the title or the link at the end. This will take you to a fuller list of available tutorials.

PHP Web Crawler Tutorials

Downloading a Webpage Using PHP and cURL

How to Download a Webpage using PHP and cURL
How to Download a Webpage using PHP and cURL

Looking to automatically download webpages? Here’s how to download a page using PHP and cURL.

Quick PHP Web Crawler Techniques

Web Crawler Development Techniques
Techniques in PHP for building web crawlers.

Looking to have your web crawler do something specific? Try this page. We have some code that we regularly use for PHP web crawler development, including extracting images, links, and JSON from HTML documents.

Creating a Simple PHP Web Crawler

How to create a simple PHP web crawler to download a website
How to create a simple PHP web crawler to download a website

Looking to download a site or multiple webpages? Interested in examining all of the titles and descriptions for a site? We created a quick tutorial on building a script to do this in PHP. Learn how to download webpages and follow links to download an entire website.

Creating a Polite PHP Web Crawler: Checking robots.txt

How to create a polite PHP web crawler using robot.txt.
How to create a polite PHP web crawler using robot.txt.

In this tutorial, we create a PHP website spider that uses the robots.txt file to know which pages we’re allowed to download. We continue from our previous tutorials to create a robust web spider and expand on it to check for download crawling permissions.

Getting Blocked? Use a Free Proxy

Web Crawlers and Proxies: How to Use Proxies with PHP Web Crawlers
How to use free proxies with PHP web crawlers.

If you’re tired of getting blocked when using your web crawlers, we recommend using a free proxy. In this article, we go over what proxies are, how to use them, and where to find free ones.

Python Web Crawler Tutorials

How to make a Web Crawler in under 50 lines of Python code

This is a tutorial made by Stephen from Net Instructions on how to make a web crawler using Python.

A Basic 12 Line Website Crawler in Python

This is a tutorial made by Mr Falkreath about creating a basic website crawler in Python using 12 lines of Python code. This includes explanations of the logic behind the crawler and how to create the Python code.

Crawl a website with scrapy

This tutorial about building a website crawler using Python and the Scrapy library, Pymongo, and pipelines.ps. It includes URL patterns, codes for building the spider, and instructions for extracting and releasing the data stored in MongoDB.

Scraping Web Pages with Scrapy – Michael Herman

This is a tutorial posted by Michael Herman about crawling web pages with Scrapy using Python using the Scrapy library. This include code for the central item class, the spider code that performs the downloading, and about storing the data once is obtained.

Java Web Crawler Tutorials

How to Write a Web Crawler in Java

This is a tutorial written by Viral Patel on how to develop a website crawler using Java.

How to make a Web Crawler using Java

This is a tutorial made by Program Creek on how to make a prototype web crawler using Java. This guide covers setting up the MySQL database, creating the database and the table, and provides sample code for building a simple web crawler.

Grandiloquent Musings: My solution to the Go Tutorial Web Crawler

This is a tutorial posted by Kim Mason on creating a parallelized web crawler using Java that only fetches urls once without duplicate downloading. This tutorial starts from an original script and modifies it to implement parallelization.

How to create a Web Crawler and storing data using Java – MrBool

This is a tutorial made by Anurag Jain on how to create a web crawler and how to efficiently store data using Java. This includes explanation for setting up the database, creating a front-end page interface for usability, describes the functionality performed, and explains the database system in relation to the final crawler.

Node.js Web Crawler Tutorials

Node.js is a JavaScript engine that runs on a server to provide information in a traditional AJAX-like manner, as well as to do stand-alone processing. Node.js is designed to be able to scale across multiple cores, and to be quick and efficient, using a single core per server and using event handlers to run everything, reducing operating system overhead with multiple processes.

Use Node.js to Extract Data from the Web for Fun and Profit

This is a tutorial posted by John Robinson in using node.js to extract website data using node.js the Cheerio library.

A Quick Introduction to Node-Wit Modules For Node.js

This is a tutorial made by Wit Ai on how to use the Node-Wit module for Node.js server application. This covers steps on how to create a Node.js app, adding and installing dependencies, sending audio, creating an index.js file, and starting the app.

simplecrawler

This is the official documentation and tutorial for the simplecrawler library. The library is designed to provide a simple API for creating crawlers with Node.js. It include codes for both simple and advanced modes, as well as providing a list of configuration options.

Scraping the Web With Node.js

This is a tutorial made by Adnan Kukic about using Node.js and jQuery to build a website crawler. This include codes for the set up, traversing the HTML DOM to find the desired content, and instructions on formatting and extracting data from the downloaded website.

Scrapy Web Crawler Tutorials

Scraping Web Pages with Scrapy – Michael Herman

This is a tutorial posted by Michael Herman about crawling web pages with Scrapy using Python using the Scrapy library. This include code for the central item class, the spider code that performs the downloading, and about storing the data once is obtained.

Scrapy Tutorial — Scrapy 0.24.5 documentation

This is an official tutorial for building a web crawler using the Scrapy library, written in Python. The tutorial walks through the tasks of: creating a project, defining the item for the class holding the Scrapy object, and writing a spider including downloading pages, extracting information, and storing it.

Build a Python Web Crawler with Scrapy – DevX

This is a tutorial made by Alessandro Zanni on how to build a Python-based web crawler using the Scrapy library. This includes describing the tools that are needed, the installation process for python, and scraper code, and the testing portion.

Web Scraping with Scrapy and MongoDB – Real Python

This is a tutorial published on Real Python about building a web crawler using Python, Scrapy, and MongoDB. This provides instruction on installing the Scrapy library and PyMongo for use with the MongoDB database; creating the spider; extracting the data; and storing the data in the MongoDB database.

Web Scraping with Scrapy and MongoDB – Real Python

This is a tutorial published on Real Python about building a web crawler using Python, Scrapy, and MongoDB. This provides instruction on installing the Scrapy library and PyMongo for use with the MongoDB database; creating the spider; extracting the data; and storing the data in the MongoDB database.

PhantomJS Web Crawler Tutorials

Web scraping with Node.js Matt’s Hacking Blog

This is a tutorial made by Matt Hacklings about web scraping and building a crawler using JavaScript, Phantom.js, Node.js, Ajax. This include codes for creating a JavaScript crawler function and the implementation of limits on the maximum number of concurrent browser sessions performing the downloading.

Getting started with Selenium Webdriver for node.js

This is a tutorial made by Max Edmands about using the selenium-webdriver library with node.js and phantom.js to build a website crawler. It includes steps for setting up the run environment, building the driver, visiting the page, verification of the page, querying the HTML DOM to obtain the desired content, and interacting with the page once the HTML has been downloaded and parsed.

Crawl you website including login form with Phantomjs – Adaltas

This is a tutorial made by Adaltas about crawling a website requiring a login form using jQuery-based JavaScript, Phantom.js to run the JavaScript, and Node.js for the server-side. It breaks the requirements for the crawler into multiple scripts, performing actions such as the: login action, function action, the action runner, and the pilot to control the system.



Looking to download a site or multiple webpages? Interested in examining all of the titles and descriptions for a site? We created a quick tutorial on building a script to do this in PHP. Learn how to download webpages and follow links to download an entire website.

Looking to automatically download webpages? Here's how to download a page using PHP and cURL.

In this tutorial, we create a PHP website spider that uses the robots.txt file to know which pages we're allowed to download. We continue from our previous tutorials to create a robust web spider and expand on it to check for download crawling permissions.

Looking for some quick code to build your web crawler in PHP? Here's some code we use a lot here at Potent Pages to make our development a lot easier!

Tired of your web crawlers getting blocked? Try using a free proxy. In this article we explain what proxies are, how to use them, and where to get them.

Scroll To Top