Downloading a Webpage Using Selenium & PHP
October 6, 2023 | By David Selden-Treiman | Filed in: php.How can you control Chrome using PHP? It’s a skill I’ve worked for years to master. I use these techniques to build web crawlers professionally all the time.
In this video, I’m going to show you a quick, condensed version that you can use today. We’ll go over how to set up a Selenium instance, as well as how to go to a page.
If you stick around to the end, I’ll show you a technique to extract all of the text from the page. This even includes all the hard to get to text created with Javascript.
A Preview
Here’s our script that we will be building
<?php
use \Facebook\WebDriver\Chrome\ChromeOptions;
use \Facebook\WebDriver\Remote\DesiredCapabilities;
use \Facebook\WebDriver\Remote\RemoteWebDriver;
use \Facebook\WebDriver\WebDriverBy;
use \Facebook\WebDriver\WebDriverExpectedCondition;
require('vendor/autoload.php');
//Settings
$seleniumHost = 'http://127.0.0.1:4444/wd/hub';
$windowSize = '1280,1024';
$userAgent = "Selenium Test Browser";
$seleniumConnectionTimeout = 10*1000; // 10 seconds
$seleniumCompletionTimeout = 60*1000; // 60 seconds
//Chromium Arguments
$args = Array();
$args[] = '--window-size='. $windowSize;
$args[] = '--user-agent='. $userAgent;
//Set Up Chrome Options
$chromeOptions = new ChromeOptions();
$chromeOptions->setExperimentalOption('w3c', false);
$chromeOptions->addArguments($args);
//Set Capabilities
$seleniumCapabilities = DesiredCapabilities::chrome();
$seleniumCapabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);
//Create Selenium Connection
$driver = RemoteWebDriver::create($seleniumHost, $seleniumCapabilities, $seleniumConnectionTimeout, $seleniumCompletionTimeout);
//Go To URL
$url = "https://potentpages.com/";
$driver->navigate()->to($url);
//Get Text
$script = file_get_contents("getPageText.js");
$text = $driver->executeScript($script);
echo $text. "\n";
//Exit
$driver->quit();
Requirements
To begin with, you’ll need Selenium grid installed and running on a machine. If you don’t already have this, we have a tutorial on how to set up Selenium grid on your own VPS.
You’ll need the cli extension for PHP installed on your computer (php-cli), since using the web version of PHP can time-out before your request is done.
If you don’t already have Composer installed, you can download it with this command:
On Linux: sudo curl -sS https://getcomposer.org/installer | php && sudo mv -f composer.phar /usr/bin/composer
On Windows: curl -sS https://getcomposer.org/installer | php
PHP Library Requirements
The main library we will be using here is php-webdriver . This allows us to connect via our PHP script to Selenium in order to control a Chrome browser. To require it, run:
composer require php-webdriver/webdriver in your terminal.
This will put the library in the “vendor” folder so that we can use it with our PHP script.
Starting The Script
To begin our code, we’ll want to create a file for our script. I’ll be using script.php here.
We need to add our starting <?php
directive to the top of the script.
Next, we’ll need some “use” lines. Here’s what we’ll be adding:
use \Facebook\WebDriver\Chrome\ChromeOptions;
use \Facebook\WebDriver\Remote\DesiredCapabilities;
use \Facebook\WebDriver\Remote\RemoteWebDriver;
use \Facebook\WebDriver\WebDriverBy;
use \Facebook\WebDriver\WebDriverExpectedCondition;
- We need the ChromeOptions to set some options for our Chrome browser.
- The DesiredCapabilities allows us to send our options to Selenium for the browser.
- RemoteWebDriver is the main class that creates our connection variable that we will be using to control our web browser.
- WebDriverBy will allow us to use xpaths to access elements on the webpages we download.
- WebDriverExpectedCondition allows us to look for the elements on the page in the WebDriverBy classes.
We also need to include our libraries that we installed with Composer. We do that with a require line:
require('vendor/autoload.php');
Defining Some Settings
Next, we’re going to define some settings.
Selenium Host
We’re going to create a $seleniumHost
variable that will set our Selenium host. This will usually be the IP address of your Selenium Grid server, colon 4444, or localhost colon 4444 if you have Selenium grid installed on your local machine.
This is then followed by the path “/wd/hub”.
$seleniumHost = 'http://127.0.0.1:4444/wd/hub';
Window Size
We’re going to create a $windowSize
variable that will set the size of our Chromium browser window. This is just the desired x size comma the desired y size.
$windowSize = '1280,1024';
User Agent
We’re going to create a $userAgent
variable next that will be sent in place of the default chromium user agent. This can be changed to whatever you’d like.
$userAgent = “Selenium Test Browser”;
Connection Timeout
We’re going to want to define the maximum number of milliseconds our connection to Selenium should take. I use a default of 10 seconds.
$seleniumConnectionTimeout = 10*1000; //10 Seconds
Completion Timeout
We’re also going to want to define the maximum number of milliseconds our script can run. I use 60 seconds for shorter scripts.
$seleniumCompletionTimeout = 60*1000; //60 Seconds
Chromium Arguments
First, we need to create an array with our Chromium arguments. These are the same as the command line arguments provided here: https://peter.sh/experiments/chromium-command-line-switches/ .
The individual arguments all start with a “–”, just like if you were to use them on the command line.
We start by defining an $args
array variable.
$args = Array();
Next, we’ll add in our window size. This is set with the “–window-size” argument
$args[] = '--window-size='. $windowSize;
We’ll finally add in our user agent. This is set with the “–user-agent” argument.
$args[] = ‘--user-agent’. $userAgent;
Setting Up Chrome Options
Now, we’re going to set up the options for our Chrome browser. To start, we’re going to want to create a new ChromeOptions object.
$chromeOptions = new ChromeOptions();
Disable w3c Mode
We’re going to want to disable w3c mode. This can cause some errors in my experience, so I recommend disabling it.
$chromeOptions->setExperimentalOption('w3c', false);
Add Arguments
Next, we’re going to want to add our Chromium arguments to our $chromeOptions variable.
$chromeOptions->addArguments($args);
Setting Our Desired Capabilities
We need to set our desired Selenium capabilities. We start this by adding a DesiredCapabilities chrome object.
$seleniumCapabilities = DesiredCapabilities::chrome();
Next, we set our $chromeOptions variable as our desired Chrome capabilities.
$seleniumCapabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);
Creating Our Selenium Instance
At this point, we can create our Selenium instance. This will be a RemoteWebDriver object. We will send our host, the desired capabilities, the connection timeout, and the completion timeout as parameters.
$driver = RemoteWebDriver::create($seleniumHost, $seleniumCapabilities, $seleniumConnectionTimeout, $seleniumCompletionTimeout);
Going to a Webpage
Now that we have our driver variable configured, we are ready to actually go to a webpage using our Chrome browser. We’ll set our $url
variable to the URL we want to go to. For this tutorial, we’ll use the Potent Pages homepage, but you can use whatever URL you’d like.
$url = “https://potentpages.com/”;
To actually navigate to the webpage, we’ll use the navigate()
method, and set a to($url)
method as well. The line should look something like this:
$driver->navigate()->to($url);
Getting the Visible Text On Our Page
With our page loaded, we can run Javascript on our page. We can simply run:
$value = $driver->executeScript($script);
where the $script
variable contains our script. Whatever our script returns will be stored in the $value variable.
For example, we can save the following script into a separate file, in this case getPageText.js:
function getVisibleText(element) {
window.getSelection().removeAllRanges();
let visibleText = '';
let range = document.createRange();
range.selectNode(element);
window.getSelection().addRange(range);
visibleText = window.getSelection().toString().trim();
window.getSelection().removeAllRanges();
return visibleText;
}
var text = getVisibleText(document.getElementsByTagName('body')[0]);
return text;
Next, on our PHP script, we will load this file into the $script
variable here.
$script = file_get_contents("getPageText.js");
We can then run this script using the executeScript method.
$text = $driver->executeScript($script);
Finally, we can show that text with a quick echo command.
echo $text. “\n”;
Running this script will display all of the visible text on the page, including any text generated with Javascript.
Ending Our Script
At the end of the script, we’ll need to add a quit()
call to end the session. If we don’t, the browser will remain open for as long as specified by the $seleniumCompletionTimeout
variable. To avoid this, we run the following command:
$driver->quit();
Our Completed Script
Here’s the completed PHP script:
<?php
use \Facebook\WebDriver\Chrome\ChromeOptions;
use \Facebook\WebDriver\Remote\DesiredCapabilities;
use \Facebook\WebDriver\Remote\RemoteWebDriver;
use \Facebook\WebDriver\WebDriverBy;
use \Facebook\WebDriver\WebDriverExpectedCondition;
require('vendor/autoload.php');
//Settings
$seleniumHost = 'http://127.0.0.1:4444/wd/hub';
$windowSize = '1280,1024';
$userAgent = "Selenium Test Browser";
$seleniumConnectionTimeout = 10*1000; // 10 seconds
$seleniumCompletionTimeout = 60*1000; // 60 seconds
//Chromium Arguments
$args = Array();
$args[] = '--window-size='. $windowSize;
$args[] = '--user-agent='. $userAgent;
//Set Up Chrome Options
$chromeOptions = new ChromeOptions();
$chromeOptions->setExperimentalOption('w3c', false);
$chromeOptions->addArguments($args);
//Set Capabilities
$seleniumCapabilities = DesiredCapabilities::chrome();
$seleniumCapabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);
//Create Selenium Connection
$driver = RemoteWebDriver::create($seleniumHost, $seleniumCapabilities, $seleniumConnectionTimeout, $seleniumCompletionTimeout);
//Go To URL
$url = "https://potentpages.com/";
$driver->navigate()->to($url);
//Get Text
$script = file_get_contents("getPageText.js");
$text = $driver->executeScript($script);
echo $text. "\n";
//Exit
$driver->quit();
Completed!
At this point, you should have a working PHP script that downloads a page and extracts the visible text for you to use.
David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.
Comments are closed here.