Downloading a Webpage Using Selenium & PHP

October 6, 2023 | By David Selden-Treiman | Filed in: php.

How can you control Chrome using PHP? It’s a skill I’ve worked for years to master. I use these techniques to build web crawlers professionally all the time.

In this video, I’m going to show you a quick, condensed version that you can use today. We’ll go over how to set up a Selenium instance, as well as how to go to a page.

If you stick around to the end, I’ll show you a technique to extract all of the text from the page. This even includes all the hard to get to text created with Javascript.

A Preview

Here’s our script that we will be building

<?php
use \Facebook\WebDriver\Chrome\ChromeOptions;
use \Facebook\WebDriver\Remote\DesiredCapabilities;
use \Facebook\WebDriver\Remote\RemoteWebDriver;
use \Facebook\WebDriver\WebDriverBy;
use \Facebook\WebDriver\WebDriverExpectedCondition;

require('vendor/autoload.php');

//Settings
$seleniumHost = 'http://127.0.0.1:4444/wd/hub';
$windowSize = '1280,1024';
$userAgent = "Selenium Test Browser";
$seleniumConnectionTimeout = 10*1000; // 10 seconds
$seleniumCompletionTimeout = 60*1000; // 60 seconds

//Chromium Arguments
$args = Array();
$args[] = '--window-size='. $windowSize;
$args[] = '--user-agent='. $userAgent;

//Set Up Chrome Options
$chromeOptions = new ChromeOptions();
$chromeOptions->setExperimentalOption('w3c', false);
$chromeOptions->addArguments($args);

//Set Capabilities
$seleniumCapabilities = DesiredCapabilities::chrome();
$seleniumCapabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);

//Create Selenium Connection
$driver = RemoteWebDriver::create($seleniumHost, $seleniumCapabilities, $seleniumConnectionTimeout, $seleniumCompletionTimeout);

//Go To URL
$url = "https://potentpages.com/";
$driver->navigate()->to($url);

//Get Text
$script = file_get_contents("getPageText.js");
$text = $driver->executeScript($script);
echo $text. "\n";

//Exit
$driver->quit();

Requirements

To begin with, you’ll need Selenium grid installed and running on a machine. If you don’t already have this, we have a tutorial on how to set up Selenium grid on your own VPS.

You’ll need the cli extension for PHP installed on your computer (php-cli), since using the web version of PHP can time-out before your request is done.

If you don’t already have Composer installed, you can download it with this command:

On Linux: sudo curl -sS https://getcomposer.org/installer | php && sudo mv -f composer.phar /usr/bin/composer

On Windows: curl -sS https://getcomposer.org/installer | php

PHP Library Requirements

The main library we will be using here is php-webdriver . This allows us to connect via our PHP script to Selenium in order to control a Chrome browser. To require it, run:

composer require php-webdriver/webdriver in your terminal.

This will put the library in the “vendor” folder so that we can use it with our PHP script.

Starting The Script

To begin our code, we’ll want to create a file for our script. I’ll be using script.php here.

We need to add our starting <?php directive to the top of the script.

Next, we’ll need some “use” lines. Here’s what we’ll be adding:

use \Facebook\WebDriver\Chrome\ChromeOptions;
use \Facebook\WebDriver\Remote\DesiredCapabilities;
use \Facebook\WebDriver\Remote\RemoteWebDriver;
use \Facebook\WebDriver\WebDriverBy;
use \Facebook\WebDriver\WebDriverExpectedCondition;

We need the ChromeOptions to set some options for our Chrome browser.
The DesiredCapabilities allows us to send our options to Selenium for the browser.
RemoteWebDriver is the main class that creates our connection variable that we will be using to control our web browser.
WebDriverBy will allow us to use xpaths to access elements on the webpages we download.
WebDriverExpectedCondition allows us to look for the elements on the page in the WebDriverBy classes.

We also need to include our libraries that we installed with Composer. We do that with a require line:

require('vendor/autoload.php');

Defining Some Settings

Next, we’re going to define some settings.

Selenium Host

We’re going to create a $seleniumHost variable that will set our Selenium host. This will usually be the IP address of your Selenium Grid server, colon 4444, or localhost colon 4444 if you have Selenium grid installed on your local machine.

This is then followed by the path “/wd/hub”.

$seleniumHost = 'http://127.0.0.1:4444/wd/hub';

Window Size

We’re going to create a $windowSize variable that will set the size of our Chromium browser window. This is just the desired x size comma the desired y size.

$windowSize = '1280,1024';

User Agent

We’re going to create a $userAgent variable next that will be sent in place of the default chromium user agent. This can be changed to whatever you’d like.

$userAgent = “Selenium Test Browser”;

Connection Timeout

We’re going to want to define the maximum number of milliseconds our connection to Selenium should take. I use a default of 10 seconds.

$seleniumConnectionTimeout = 10*1000; //10 Seconds

Completion Timeout

We’re also going to want to define the maximum number of milliseconds our script can run. I use 60 seconds for shorter scripts.

$seleniumCompletionTimeout = 60*1000; //60 Seconds

Chromium Arguments

First, we need to create an array with our Chromium arguments. These are the same as the command line arguments provided here: https://peter.sh/experiments/chromium-command-line-switches/ .

The individual arguments all start with a “–”, just like if you were to use them on the command line.

We start by defining an $args array variable.

$args = Array();

Next, we’ll add in our window size. This is set with the “–window-size” argument

$args[] = '--window-size='. $windowSize;

We’ll finally add in our user agent. This is set with the “–user-agent” argument.

$args[] = ‘--user-agent’. $userAgent;

Setting Up Chrome Options

Now, we’re going to set up the options for our Chrome browser. To start, we’re going to want to create a new ChromeOptions object.

$chromeOptions = new ChromeOptions();

Disable w3c Mode

We’re going to want to disable w3c mode. This can cause some errors in my experience, so I recommend disabling it.

$chromeOptions->setExperimentalOption('w3c', false);

Add Arguments

Next, we’re going to want to add our Chromium arguments to our $chromeOptions variable.

$chromeOptions->addArguments($args);

Setting Our Desired Capabilities

We need to set our desired Selenium capabilities. We start this by adding a DesiredCapabilities chrome object.

$seleniumCapabilities = DesiredCapabilities::chrome();

Next, we set our $chromeOptions variable as our desired Chrome capabilities.

$seleniumCapabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);

Creating Our Selenium Instance

At this point, we can create our Selenium instance. This will be a RemoteWebDriver object. We will send our host, the desired capabilities, the connection timeout, and the completion timeout as parameters.

$driver = RemoteWebDriver::create($seleniumHost, $seleniumCapabilities, $seleniumConnectionTimeout, $seleniumCompletionTimeout);

Going to a Webpage

Now that we have our driver variable configured, we are ready to actually go to a webpage using our Chrome browser. We’ll set our $url variable to the URL we want to go to. For this tutorial, we’ll use the Potent Pages homepage, but you can use whatever URL you’d like.

$url = “https://potentpages.com/”;

To actually navigate to the webpage, we’ll use the navigate() method, and set a to($url) method as well. The line should look something like this:

$driver->navigate()->to($url);

Getting the Visible Text On Our Page

With our page loaded, we can run Javascript on our page. We can simply run:

$value = $driver->executeScript($script);

where the $script variable contains our script. Whatever our script returns will be stored in the $value variable.

For example, we can save the following script into a separate file, in this case getPageText.js:

function getVisibleText(element) {
    window.getSelection().removeAllRanges();
    
    let visibleText = '';
    let range = document.createRange();
    range.selectNode(element);
    window.getSelection().addRange(range);
    visibleText = window.getSelection().toString().trim();
    window.getSelection().removeAllRanges();
    
    return visibleText;
}

var text = getVisibleText(document.getElementsByTagName('body')[0]);
return text;

Next, on our PHP script, we will load this file into the $script variable here.

$script = file_get_contents("getPageText.js");

We can then run this script using the executeScript method.

$text = $driver->executeScript($script);

Finally, we can show that text with a quick echo command.

echo $text. “\n”;

Running this script will display all of the visible text on the page, including any text generated with Javascript.

Ending Our Script

At the end of the script, we’ll need to add a quit() call to end the session. If we don’t, the browser will remain open for as long as specified by the $seleniumCompletionTimeout variable. To avoid this, we run the following command:

$driver->quit();

Our Completed Script

Here’s the completed PHP script:

<?php
use \Facebook\WebDriver\Chrome\ChromeOptions;
use \Facebook\WebDriver\Remote\DesiredCapabilities;
use \Facebook\WebDriver\Remote\RemoteWebDriver;
use \Facebook\WebDriver\WebDriverBy;
use \Facebook\WebDriver\WebDriverExpectedCondition;

require('vendor/autoload.php');

//Settings
$seleniumHost = 'http://127.0.0.1:4444/wd/hub';
$windowSize = '1280,1024';
$userAgent = "Selenium Test Browser";
$seleniumConnectionTimeout = 10*1000; // 10 seconds
$seleniumCompletionTimeout = 60*1000; // 60 seconds

//Chromium Arguments
$args = Array();
$args[] = '--window-size='. $windowSize;
$args[] = '--user-agent='. $userAgent;

//Set Up Chrome Options
$chromeOptions = new ChromeOptions();
$chromeOptions->setExperimentalOption('w3c', false);
$chromeOptions->addArguments($args);

//Set Capabilities
$seleniumCapabilities = DesiredCapabilities::chrome();
$seleniumCapabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);

//Create Selenium Connection
$driver = RemoteWebDriver::create($seleniumHost, $seleniumCapabilities, $seleniumConnectionTimeout, $seleniumCompletionTimeout);

//Go To URL
$url = "https://potentpages.com/";
$driver->navigate()->to($url);

//Get Text
$script = file_get_contents("getPageText.js");
$text = $driver->executeScript($script);
echo $text. "\n";

//Exit
$driver->quit();

Completed!

At this point, you should have a working PHP script that downloads a page and extracts the visible text for you to use.

David Selden-Treiman

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Tags: PHP Web Crawler

Comments are closed here.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Web Crawler Industries

There are a lot of uses of web crawlers across industries. Industries benefiting from web crawlers include:

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

T
H
E
M
E

Scroll To Top