Give us a call: (800) 252-6164
Web Crawlers · PHP · robots.txt Compliance

POLITE PHP WEB CRAWLER
How to Check robots.txt Before Downloading Pages

If you’re building a web crawler (or web scraping tool) in PHP, the first step toward doing it responsibly is checking robots.txt and honoring crawler rules like User-agent, Allow/Disallow, and crawl-rate guidance. This tutorial shows a clean, production-friendly pattern you can drop into your crawler.

  • Check robots.txt per host
  • Identify your crawler with a real UA
  • Skip disallowed URLs automatically
  • Throttle requests responsibly

What robots.txt is (and what it isn’t)

A robots.txt file tells crawlers which URLs they are allowed to access on a site and is commonly used to reduce load on servers. It is not an authentication system and it’s not a reliable way to keep private content secret.

Important: robots.txt is primarily about crawler behavior and server load; it is not a security boundary. If a site wants to restrict access, it needs authentication or other controls.

What you’ll build

By the end of this tutorial you’ll have a simple, repeatable pattern you can use in any PHP crawler:

  • Download /robots.txt for a host
  • Parse the file using a maintained PHP robots.txt parser library
  • Set your crawler User-agent and check each URL path before downloading it
  • Throttle requests so your crawler behaves politely
SEO intent: This page focuses on “polite PHP web crawler”, “check robots.txt PHP”, and “robots.txt parser PHP” so it’s easier to discover and easier to skim.

Prerequisites

  • PHP with cURL enabled
  • Composer installed
  • An existing crawler loop (or a basic downloader + link queue)

If you’re following the Potent Pages PHP crawler series, this tutorial is designed to drop into your existing “download + parse + queue links” workflow.

Step 1 — Install the robots.txt parser

We’ll use the t1gor/robots-txt-parser Composer package, which supports common directives like User-agent, Allow, Disallow, and crawl delay directives.

composer require t1gor/robots-txt-parser

Then load Composer’s autoloader near the top of your script:

<?php
require __DIR__ . '/vendor/autoload.php';

Step 2 — Use a real, descriptive User-agent

A polite crawler identifies itself clearly. Use a name + version, and ideally include a URL or email so site owners can contact you if needed. This is simple, but it reduces confusion and makes it easier to resolve issues quickly.

// Crawler identity (be specific + include a contact URL if possible)
$user_agent = "YourCrawlerName/1.0 (+https://potentpages.com/contact/)";
Tip: Keep the user agent consistent across robots.txt checks and page downloads. Otherwise you may check permissions for one UA and crawl as another.

Step 3 — Download robots.txt safely

robots.txt is typically located at https://example.com/robots.txt. Your crawler should:

  • Fetch it once per host
  • Handle 404 (no robots.txt) without crashing
  • Store the HTTP status code along with the file body
  • Cache it (don’t re-download it on every URL)
function http_get($url, $referer = "", $user_agent = ""){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HTTPGET, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

    if($referer !== ""){
        curl_setopt($ch, CURLOPT_REFERER, $referer);
    }

    $raw = curl_exec($ch);
    if($raw === false){
        $err = curl_error($ch);
        curl_close($ch);
        return ["status" => 0, "headers" => [], "body" => "", "error" => $err];
    }

    $status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    $header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
    curl_close($ch);

    $header_text = substr($raw, 0, $header_size);
    $body = substr($raw, $header_size);

    return ["status" => $status, "headers_raw" => $header_text, "body" => $body, "error" => ""];
}

Step 4 — Parse robots.txt and validate your seed URL

Initialize the parser with the robots.txt contents, provide the HTTP status code, and set your crawler’s User-agent. Then validate your seed path before you start crawling.

use t1gor\\RobotsTxtParser\\RobotsTxtParser;

// Example seed
$seed_url = "https://example.com/some/path";
$seed_parts = parse_url($seed_url);
$base = $seed_parts['scheme'] . "://" . $seed_parts['host'];
$seed_path = $seed_parts['path'] ?? "/";

// Download robots.txt for this host
$robots_url = $base . "/robots.txt";
$robots_res = http_get($robots_url, "", $user_agent);

// Create parser (even if body is empty)
$parser = new RobotsTxtParser($robots_res['body']);
$parser->setHttpStatusCode((int)$robots_res['status']);
$parser->setUserAgent($user_agent);

// Stop early if your seed path is disallowed
if($parser->isDisallowed($seed_path)){
    die("Robots.txt: seed URL path is disallowed for this User-agent\\n");
}
Why do we pass the HTTP status? Some robots rules depend on whether the file was successfully retrieved. Your crawler should treat “missing robots.txt” differently than “robots.txt explicitly disallows a path.”

Step 5 — Check every URL before downloading it

The rule is simple: never request a page you already know is disallowed. In your crawl loop, check the URL path (not the full URL string) and skip disallowed pages.

// Example crawl loop sketch:
// $queue contains absolute URLs OR paths (depending on your crawler design)

while(!empty($queue)){
    $url = array_shift($queue);

    // Normalize to a path for robots.txt checks
    $parts = parse_url($url);
    $path = $parts['path'] ?? "/";

    // Polite behavior: skip disallowed paths
    if($parser->isDisallowed($path)){
        continue;
    }

    // If allowed, download the page
    $res = http_get($url, $base, $user_agent);

    // TODO: parse page, extract links, push new URLs into $queue
    // TODO: store results in DB, etc.

    // Throttle requests (see next step)
    sleep(1);
}
Path hygiene: Remove fragments (#...) and be consistent about trailing slashes so your crawler’s behavior is predictable.

Step 6 — Throttle responsibly (crawl-delay + rate limiting)

Many sites publish crawl-rate guidance to reduce load. Not all directives are universally standardized, but you should still:

  • Apply a conservative default delay (ex: 1–2 seconds per request)
  • If the robots file includes a crawl-delay directive and your parser supports it, honor it
  • Back off on errors (429/503), and avoid aggressive retry loops
// Example: use a conservative default delay (seconds)
$default_delay = 1;

// If your workflow extracts a crawl delay from the parser,
// apply max(parserDelay, defaultDelay). (Method names can differ by library/version.)
$delay = $default_delay;

// Pseudocode (keep your implementation simple and test it):
// $delay_from_robots = $parser->getCrawlDelay(); // if supported
// if(is_numeric($delay_from_robots)) $delay = max($default_delay, (int)$delay_from_robots);

sleep($delay);
Operational reality: Politeness is not only about “not getting blocked.” It’s about building a crawler that can run for months without causing incidents.

A cleaner “final script” you can adapt

Below is a compact example that pulls the pieces together. In a production crawler you’d also: cache robots.txt per host, persist a queue, and add monitoring/logging.

<?php
require __DIR__ . '/vendor/autoload.php';

use t1gor\\RobotsTxtParser\\RobotsTxtParser;

$user_agent = "YourCrawlerName/1.0 (+https://potentpages.com/contact/)";
$default_delay = 1;

function http_get($url, $referer = "", $user_agent = ""){
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HTTPGET, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
    if($referer !== "") curl_setopt($ch, CURLOPT_REFERER, $referer);

    $raw = curl_exec($ch);
    if($raw === false){
        $err = curl_error($ch);
        curl_close($ch);
        return ["status" => 0, "body" => "", "error" => $err];
    }

    $status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    $header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
    curl_close($ch);

    $body = substr($raw, $header_size);
    return ["status" => $status, "body" => $body, "error" => ""];
}

// Seed
$seed_url = "https://example.com/";
$seed_parts = parse_url($seed_url);
$base = $seed_parts['scheme'] . "://" . $seed_parts['host'];

// robots.txt
$robots = http_get($base . "/robots.txt", "", $user_agent);
$parser = new RobotsTxtParser($robots['body']);
$parser->setHttpStatusCode((int)$robots['status']);
$parser->setUserAgent($user_agent);

// Simple queue (replace with DB-backed queue in production)
$queue = [$seed_url];
$seen = [];

while(!empty($queue)){
    $url = array_shift($queue);
    if(isset($seen[$url])) continue;
    $seen[$url] = true;

    $parts = parse_url($url);
    $path = $parts['path'] ?? "/";

    if($parser->isDisallowed($path)){
        continue;
    }

    $res = http_get($url, $base, $user_agent);
    if($res['status'] === 429 || $res['status'] === 503){
        // Basic backoff
        sleep(max(5, $default_delay));
        continue;
    }

    // TODO: Extract links and add to $queue (stay on-host, normalize URLs, etc.)
    // TODO: Store content/metadata

    sleep($default_delay);
}

echo "Done\\n";
Next steps: Add URL normalization, host restrictions, caching robots.txt per host, and monitoring/alerts for breakage.

FAQ: robots.txt, crawl politeness, and PHP crawlers

These are common questions teams ask when they’re moving from a quick script to a crawler that runs reliably and responsibly.

Does robots.txt give me “permission” to scrape? +

robots.txt provides crawler access guidance (what paths a crawler should access) and is often used to reduce server load. It is not a security mechanism, and it does not replace legal review or site terms where applicable.

What if a website has no robots.txt (404)? +

A missing robots.txt is common. Your crawler should handle it gracefully and still use conservative rate limiting. Treat “missing file” differently than “explicit disallow rules.”

Should I check robots.txt on every request? +

No. Download it once per host and cache it for a reasonable period. Re-check periodically (or when a crawl runs long), but don’t fetch robots.txt for every page.

What’s the difference between Allow and Disallow? +

robots rules are evaluated per User-agent. A file can disallow broad paths and allow specific exceptions. Using a parser library helps you avoid implementing edge cases incorrectly.

When should I move beyond a tutorial crawler? +

If your crawler needs to run continuously, hit many pages, or feed business-critical decisions, you’ll want: caching, monitoring, alerting, backoff logic, queue persistence, and change/breakage handling.

Need a compliant crawler pipeline (not just a script)?

We build, run, and maintain web crawlers with monitoring, alerts, and structured delivery—so your team can focus on outcomes.

David Selden-Treiman, Director of Operations at Potent Pages.

David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

Web Crawlers

Data Collection

There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

Development

Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

Web Crawler Industries

There are a lot of uses of web crawlers across industries to generate strategic advantages and alpha. Industries benefiting from web crawlers include:

Building Your Own

If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

Legality of Web Crawlers

Web crawlers are generally legal if used properly and respectfully.

GPT & Web Crawlers

GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

Scroll To Top