What you’ll build
You’ll create a small PHP script that connects to a Selenium server, launches Chrome, navigates to a target URL, and runs a JavaScript helper that returns all visible text on the page (including content rendered by JavaScript).
Quickstart (copy/paste)
If you already have Selenium running, this is the simplest working script. Then we’ll break down each piece below.
<?php
use Facebook\WebDriver\Chrome\ChromeOptions;
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
require __DIR__ . '/vendor/autoload.php';
// ===== Settings =====
$seleniumHost = 'http://127.0.0.1:4444/wd/hub'; // common default (Grid/Standalone)
$windowSize = '1280,1024';
$userAgent = 'Selenium PHP Crawler';
$connTimeout = 10 * 1000; // ms
$reqTimeout = 60 * 1000; // ms
// ===== Chrome args =====
$args = [
'--window-size=' . $windowSize,
'--user-agent=' . $userAgent,
// Optional hardening flags (often helpful in servers/containers):
// '--headless=new',
// '--disable-gpu',
// '--no-sandbox',
// '--disable-dev-shm-usage',
];
// ===== Options + capabilities =====
$chromeOptions = new ChromeOptions();
$chromeOptions->setExperimentalOption('w3c', false);
$chromeOptions->addArguments($args);
$capabilities = DesiredCapabilities::chrome();
$capabilities->setCapability(ChromeOptions::CAPABILITY, $chromeOptions);
// ===== Run =====
$driver = null;
try {
$driver = RemoteWebDriver::create($seleniumHost, $capabilities, $connTimeout, $reqTimeout);
$url = 'https://potentpages.com/';
$driver->navigate()->to($url);
// Wait a beat if needed (simple approach). For production, prefer explicit waits.
usleep(350 * 1000);
$script = file_get_contents(__DIR__ . '/getPageText.js');
$text = $driver->executeScript($script);
echo $text . PHP_EOL;
} finally {
if ($driver) {
$driver->quit();
}
}
Next you’ll create getPageText.js (provided below) and run the script from CLI.
Requirements
- PHP CLI (recommended). Running via the web server can time out on slow pages.
- Composer for dependency installation.
- Selenium (Grid or standalone) running locally or on a VPS.
- Chrome available to Selenium (usually via Selenium’s Docker images).
Step 1 — Install php-webdriver
The main library we’ll use is php-webdriver, which lets PHP connect to Selenium and control Chrome. In your project folder:
composer require php-webdriver/webdriver
This creates a vendor/ directory and an autoloader you’ll include in your script.
Step 2 — Start Selenium
You can run Selenium locally or on a server. For quick local testing, Selenium’s Docker images are the simplest path. Here’s a common example (choose an approach that matches your environment):
# Example (Docker): Selenium standalone Chrome
# docker run -d --rm -p 4444:4444 --shm-size="2g" selenium/standalone-chrome
Step 3 — Create the PHP script
Create a file
Create script.php in your project folder.
Add imports + autoload
Include Composer’s autoloader so the webdriver classes resolve.
Set Chrome options
Window size + user agent make debugging easier and reduce “it works on my machine” differences.
Create RemoteWebDriver
Connect to Selenium at http://host:4444/wd/hub (typical default).
Navigate to a URL
Use $driver->navigate()->to($url) to load the page in Chrome.
Step 4 — Extract visible text (including JavaScript-rendered content)
Once the page is loaded, Selenium can run JavaScript in the browser context. We’ll use a small helper that selects the body and returns the browser’s “visible text” representation.
Create getPageText.js
function getVisibleText(element) {
window.getSelection().removeAllRanges();
const range = document.createRange();
range.selectNode(element);
window.getSelection().addRange(range);
const visibleText = window.getSelection().toString().trim();
window.getSelection().removeAllRanges();
return visibleText;
}
return getVisibleText(document.body);
Run it from PHP
$script = file_get_contents(__DIR__ . '/getPageText.js');
$text = $driver->executeScript($script);
echo $text . PHP_EOL;
Step 5 — Run the crawler
From your project folder:
php script.php
You should see the page’s visible text printed in your terminal.
Troubleshooting
- “Could not connect to Selenium”: confirm Selenium is reachable on port 4444 and that your host URL is correct.
- Blank/partial text: the page may still be rendering; add an explicit wait or a short delay before executing JS.
- Docker Chrome crashes: increase
--shm-sizeor add--disable-dev-shm-usage. - Hanging sessions: ensure
$driver->quit()runs in afinallyblock (as in the quickstart).
Next steps (turn this into a real crawler)
This tutorial is a “first win.” Production crawlers usually add:
- Explicit waits (wait for specific DOM elements instead of sleeping)
- Retries + error capture (screenshots/HTML dumps on failure)
- Extraction (parse specific fields instead of full text)
- Scheduling (daily/weekly runs) + monitoring/alerts
- Compliance + safety (robots, rate limits, polite behavior)
Need a maintained Selenium pipeline?
If your team needs repeatable collection (finance/law/enterprise), Potent Pages builds monitored crawlers that deliver structured outputs on schedule.
FAQ: Selenium + PHP Web Crawling
Common questions about downloading pages with Selenium in PHP, especially when content is rendered with JavaScript.
Why use Selenium instead of cURL for PHP web crawling? +
Use cURL when the HTML response contains the data you need. Use Selenium when the page renders content in the browser after load (client-side apps, “load more”, dynamic pricing/availability, authenticated flows).
Can Selenium extract text created by JavaScript? +
Yes. Selenium drives a real browser, so it can read DOM content after JavaScript runs. In this tutorial, we run a small JavaScript snippet to return the page’s visible text.
What’s the easiest way to run Selenium for this tutorial? +
For most developers, Docker + a Selenium standalone Chrome image is the quickest path for local testing. On a VPS, you’ll typically run Selenium/Grid and point your PHP script at that host.
How do I wait for the page to finish rendering? +
The best approach is an explicit wait: wait until a specific element exists or until a known selector contains text. For a quick demo, a short delay can work, but explicit waits are much more reliable.
Can Potent Pages build this as a maintained data pipeline? +
Yes—especially when you need scheduled runs, monitoring, alerting, and structured outputs (CSV/DB/API). That’s where “tutorial code” becomes production infrastructure.
