What robots.txt is (and what it isn’t)
A robots.txt file tells crawlers which URLs they are allowed to access on a site and is commonly used to reduce load on servers. It is not an authentication system and it’s not a reliable way to keep private content secret.
What you’ll build
By the end of this tutorial you’ll have a simple, repeatable pattern you can use in any PHP crawler:
- Download
/robots.txtfor a host - Parse the file using a maintained PHP robots.txt parser library
- Set your crawler User-agent and check each URL path before downloading it
- Throttle requests so your crawler behaves politely
Prerequisites
- PHP with cURL enabled
- Composer installed
- An existing crawler loop (or a basic downloader + link queue)
If you’re following the Potent Pages PHP crawler series, this tutorial is designed to drop into your existing “download + parse + queue links” workflow.
Step 1 — Install the robots.txt parser
We’ll use the t1gor/robots-txt-parser Composer package, which supports common directives like
User-agent, Allow, Disallow, and crawl delay directives.
composer require t1gor/robots-txt-parser
Then load Composer’s autoloader near the top of your script:
<?php
require __DIR__ . '/vendor/autoload.php';
Step 2 — Use a real, descriptive User-agent
A polite crawler identifies itself clearly. Use a name + version, and ideally include a URL or email so site owners can contact you if needed. This is simple, but it reduces confusion and makes it easier to resolve issues quickly.
// Crawler identity (be specific + include a contact URL if possible)
$user_agent = "YourCrawlerName/1.0 (+https://potentpages.com/contact/)";
Step 3 — Download robots.txt safely
robots.txt is typically located at https://example.com/robots.txt. Your crawler should:
- Fetch it once per host
- Handle 404 (no robots.txt) without crashing
- Store the HTTP status code along with the file body
- Cache it (don’t re-download it on every URL)
function http_get($url, $referer = "", $user_agent = ""){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
if($referer !== ""){
curl_setopt($ch, CURLOPT_REFERER, $referer);
}
$raw = curl_exec($ch);
if($raw === false){
$err = curl_error($ch);
curl_close($ch);
return ["status" => 0, "headers" => [], "body" => "", "error" => $err];
}
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
curl_close($ch);
$header_text = substr($raw, 0, $header_size);
$body = substr($raw, $header_size);
return ["status" => $status, "headers_raw" => $header_text, "body" => $body, "error" => ""];
}
Step 4 — Parse robots.txt and validate your seed URL
Initialize the parser with the robots.txt contents, provide the HTTP status code, and set your crawler’s User-agent. Then validate your seed path before you start crawling.
use t1gor\\RobotsTxtParser\\RobotsTxtParser;
// Example seed
$seed_url = "https://example.com/some/path";
$seed_parts = parse_url($seed_url);
$base = $seed_parts['scheme'] . "://" . $seed_parts['host'];
$seed_path = $seed_parts['path'] ?? "/";
// Download robots.txt for this host
$robots_url = $base . "/robots.txt";
$robots_res = http_get($robots_url, "", $user_agent);
// Create parser (even if body is empty)
$parser = new RobotsTxtParser($robots_res['body']);
$parser->setHttpStatusCode((int)$robots_res['status']);
$parser->setUserAgent($user_agent);
// Stop early if your seed path is disallowed
if($parser->isDisallowed($seed_path)){
die("Robots.txt: seed URL path is disallowed for this User-agent\\n");
}
Step 5 — Check every URL before downloading it
The rule is simple: never request a page you already know is disallowed. In your crawl loop, check the URL path (not the full URL string) and skip disallowed pages.
// Example crawl loop sketch:
// $queue contains absolute URLs OR paths (depending on your crawler design)
while(!empty($queue)){
$url = array_shift($queue);
// Normalize to a path for robots.txt checks
$parts = parse_url($url);
$path = $parts['path'] ?? "/";
// Polite behavior: skip disallowed paths
if($parser->isDisallowed($path)){
continue;
}
// If allowed, download the page
$res = http_get($url, $base, $user_agent);
// TODO: parse page, extract links, push new URLs into $queue
// TODO: store results in DB, etc.
// Throttle requests (see next step)
sleep(1);
}
#...) and be consistent about trailing slashes so your crawler’s behavior is predictable.
Step 6 — Throttle responsibly (crawl-delay + rate limiting)
Many sites publish crawl-rate guidance to reduce load. Not all directives are universally standardized, but you should still:
- Apply a conservative default delay (ex: 1–2 seconds per request)
- If the robots file includes a crawl-delay directive and your parser supports it, honor it
- Back off on errors (429/503), and avoid aggressive retry loops
// Example: use a conservative default delay (seconds)
$default_delay = 1;
// If your workflow extracts a crawl delay from the parser,
// apply max(parserDelay, defaultDelay). (Method names can differ by library/version.)
$delay = $default_delay;
// Pseudocode (keep your implementation simple and test it):
// $delay_from_robots = $parser->getCrawlDelay(); // if supported
// if(is_numeric($delay_from_robots)) $delay = max($default_delay, (int)$delay_from_robots);
sleep($delay);
A cleaner “final script” you can adapt
Below is a compact example that pulls the pieces together. In a production crawler you’d also: cache robots.txt per host, persist a queue, and add monitoring/logging.
<?php
require __DIR__ . '/vendor/autoload.php';
use t1gor\\RobotsTxtParser\\RobotsTxtParser;
$user_agent = "YourCrawlerName/1.0 (+https://potentpages.com/contact/)";
$default_delay = 1;
function http_get($url, $referer = "", $user_agent = ""){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
if($referer !== "") curl_setopt($ch, CURLOPT_REFERER, $referer);
$raw = curl_exec($ch);
if($raw === false){
$err = curl_error($ch);
curl_close($ch);
return ["status" => 0, "body" => "", "error" => $err];
}
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
curl_close($ch);
$body = substr($raw, $header_size);
return ["status" => $status, "body" => $body, "error" => ""];
}
// Seed
$seed_url = "https://example.com/";
$seed_parts = parse_url($seed_url);
$base = $seed_parts['scheme'] . "://" . $seed_parts['host'];
// robots.txt
$robots = http_get($base . "/robots.txt", "", $user_agent);
$parser = new RobotsTxtParser($robots['body']);
$parser->setHttpStatusCode((int)$robots['status']);
$parser->setUserAgent($user_agent);
// Simple queue (replace with DB-backed queue in production)
$queue = [$seed_url];
$seen = [];
while(!empty($queue)){
$url = array_shift($queue);
if(isset($seen[$url])) continue;
$seen[$url] = true;
$parts = parse_url($url);
$path = $parts['path'] ?? "/";
if($parser->isDisallowed($path)){
continue;
}
$res = http_get($url, $base, $user_agent);
if($res['status'] === 429 || $res['status'] === 503){
// Basic backoff
sleep(max(5, $default_delay));
continue;
}
// TODO: Extract links and add to $queue (stay on-host, normalize URLs, etc.)
// TODO: Store content/metadata
sleep($default_delay);
}
echo "Done\\n";
FAQ: robots.txt, crawl politeness, and PHP crawlers
These are common questions teams ask when they’re moving from a quick script to a crawler that runs reliably and responsibly.
Does robots.txt give me “permission” to scrape? +
robots.txt provides crawler access guidance (what paths a crawler should access) and is often used to reduce server load. It is not a security mechanism, and it does not replace legal review or site terms where applicable.
What if a website has no robots.txt (404)? +
A missing robots.txt is common. Your crawler should handle it gracefully and still use conservative rate limiting. Treat “missing file” differently than “explicit disallow rules.”
Should I check robots.txt on every request? +
No. Download it once per host and cache it for a reasonable period. Re-check periodically (or when a crawl runs long), but don’t fetch robots.txt for every page.
What’s the difference between Allow and Disallow? +
robots rules are evaluated per User-agent. A file can disallow broad paths and allow specific exceptions.
Using a parser library helps you avoid implementing edge cases incorrectly.
When should I move beyond a tutorial crawler? +
If your crawler needs to run continuously, hit many pages, or feed business-critical decisions, you’ll want: caching, monitoring, alerting, backoff logic, queue persistence, and change/breakage handling.
Need a compliant crawler pipeline (not just a script)?
We build, run, and maintain web crawlers with monitoring, alerts, and structured delivery—so your team can focus on outcomes.
