Give us a call: (800) 252-6164
Web Crawling · JavaScript Rendering · AJAX Extraction

EXTRACT DATA FROM AJAX PAGES
The Practical Methods That Work on JavaScript-Heavy Sites

If your scraper “works” but returns empty HTML, you’re probably hitting a JavaScript-rendered page. This guide breaks down the best methods to extract data from AJAX pages — from finding hidden JSON endpoints to rendering with headless browsers — and how to choose the most reliable approach for your project.

  • Fastest: call the underlying XHR/JSON
  • Most robust: render with Playwright/Selenium
  • At scale: browser pools + monitoring
  • Hard mode: login + sessions + paging

TL;DR: Can web crawlers extract data from JavaScript / AJAX pages?

Yes — but the “best method” depends on how the page loads data. Many AJAX-driven pages fetch content via background requests (XHR/fetch) and then render it in the browser. If your crawler only downloads HTML, it may never see the real data.

Rule of thumb: If you can capture a stable JSON endpoint, do that first (it’s faster and cheaper). If the page requires complex UI behavior, render it with a headless browser (Playwright/Selenium).

What “AJAX pages” actually means (and why scrapers fail)

“AJAX pages” are usually just pages where the visible content is loaded after the initial HTML response. Instead of shipping everything in server-rendered HTML, the page makes background requests for data — often JSON — and then JavaScript inserts that data into the DOM.

  • Client-side rendering (CSR): HTML is a shell; content appears after JS runs.
  • XHR/fetch requests: API-style calls return JSON used to build the page.
  • GraphQL: a single endpoint serves many queries.
  • Infinite scroll: paging happens via background requests as the user scrolls.
  • WebSockets: live updates stream data (less common, but important for certain sites).

Limitations of traditional web crawlers on JavaScript-heavy sites

Traditional crawlers download a URL and parse the returned HTML. That works well for server-rendered pages, but it fails when the content is generated in the browser.

  • Empty HTML: your parser can’t find elements because they don’t exist yet.
  • Missing pagination: “Next page” is often an API call, not a link.
  • Stateful UI: filters and sorting can be client-side and require interaction.
  • Authentication: data may require sessions, tokens, or per-request signatures.
Good news: the data usually exists somewhere — either as a network response (JSON) or as a rendered DOM. The job is choosing the extraction layer that’s most stable.

The best methods to extract data from AJAX pages

Most AJAX extraction strategies fall into three buckets. The best choice depends on the site’s complexity, how often you need data, and how sensitive the workflow is to breakage.

Method Best for Pros Tradeoffs
1) Call the underlying JSON/XHR Catalogs, listings, search results, paging APIs Fast, low compute, easy to scale, stable output May require auth/tokens; endpoints can change
2) Render with a headless browser Complex UI, client-only rendering, heavy JS frameworks High fidelity; handles interactions and dynamic DOM Higher cost; needs ops for scaling and reliability
3) Use a rendering service / browser pool High-volume crawling of JS sites, production pipelines Centralized rendering; consistent environments; parallelization More infrastructure; requires monitoring and tuning
Most teams get the best ROI by starting with network/JSON extraction and only using full rendering when the site truly requires it.

How to choose the right approach (a practical workflow)

If you want to scrape a JavaScript website reliably, treat it like a debugging process. You’re trying to locate the “truth layer” where data is easiest to capture consistently.

1

Inspect the Network tab (find the data source)

Open DevTools → Network → filter by XHR/fetch. Reload the page and look for JSON responses that contain the values you need (prices, names, IDs, pagination cursors).

2

Validate pagination and parameters

Change a filter, sort option, or page number and watch what requests change. Stable parameters (page, offset, cursor, category ID) are ideal for production extraction.

3

Check authentication and session requirements

If requests rely on cookies, bearer tokens, or per-request headers, you’ll need a session strategy: login automation, token refresh, and safe storage of credentials.

4

Only render the page if you must

If there’s no usable endpoint — or content is computed purely in the browser — use a headless browser (Playwright, Selenium, Puppeteer) to render and extract from the final DOM.

5

Engineer for durability: monitoring + change handling

Production crawlers need more than extraction. Add retries, rate limiting, schema checks, alerts, and repair workflows so site changes don’t silently break your data pipeline.

Method 1: Render JavaScript with a headless browser

Rendering means launching a real browser engine (Chromium/Firefox/WebKit), letting JavaScript run, and extracting data from the final DOM. This is the most direct way to extract data from AJAX pages when content only exists after client-side rendering.

  • Playwright: excellent default for modern JS pages; strong reliability features.
  • Selenium: mature ecosystem; common in many stacks; good when you already use it.
  • Puppeteer: popular in Node.js; great for Chromium-only workflows.
When rendering is the right call: highly interactive pages, heavy client frameworks, UI-only pagination, or content that never appears as a clean network response.

Method 2: Extract from the underlying XHR/JSON (often the best method)

Many “AJAX pages” are actually front-ends for an API. The browser downloads JSON, then renders it. If you can request that same JSON directly, you get a faster crawler with less infrastructure.

  • Speed: no browser startup costs, no rendering time.
  • Scale: easy parallelization without running thousands of browser instances.
  • Quality: data arrives structured (JSON), reducing brittle DOM selectors.
Pro tip: If the network response includes stable IDs (product_id, company_id), store them — IDs make change tracking and deduping dramatically easier.

Method 3: Use a rendering service / browser pool for production-scale crawling

If you need to crawl lots of JavaScript pages on a schedule (daily/weekly/continuous), a single machine running a few headless browsers usually isn’t enough. Production pipelines often use containerized browsers, distributed workers, and centralized monitoring.

  • Browser pools: run many isolated browser workers in parallel.
  • Queue-based crawling: enqueue URLs/jobs for consistent throughput.
  • Observability: alert on failures, selector breakage, and output anomalies.
  • Replayability: keep raw snapshots to re-extract if schemas evolve.

Other considerations: reliability, ethics, and long-run maintenance

Extracting data from AJAX pages is usually straightforward once the method is right. The real difficulty is keeping a crawler healthy over time as sites change.

  • Rate limiting & politeness: avoid overloading a site; crawl at a stable cadence.
  • Schema enforcement: validate outputs so silent failures don’t pollute downstream analysis.
  • Change detection: site layouts and endpoints change; monitor and repair quickly.
  • Compliance: align collection with applicable terms, permissions, and governance requirements.
Production reality: The best crawler is the one that keeps delivering clean data six months from now. Plan for monitoring, fixes, and evolution — not just “first extraction.”

FAQ: Extracting data from AJAX and JavaScript pages

These are common questions teams ask when their scrapers fail on dynamic websites or JavaScript-heavy apps.

How do you scrape AJAX content without Selenium? +

The most common approach is to extract data from the underlying XHR/fetch requests. Open your browser’s Network tab, find the request that returns the data (often JSON), and replicate that request in your crawler.

Why it’s often best: JSON extraction is faster, cheaper to run, and usually less brittle than DOM scraping.
Is Playwright better than Selenium for scraping? +

Playwright is a popular choice for modern JavaScript-heavy sites because it’s built for automation reliability, supports multiple browser engines, and provides strong controls for waiting on network/idleness and handling dynamic pages. Selenium remains a solid option, especially in stacks that already standardize on it.

How do I find the JSON endpoint for an AJAX page? +
  1. Open DevTools → Network
  2. Filter by XHR/fetch
  3. Reload the page and click the request that returns data
  4. Check “Response” for JSON; note query params and headers
  5. Test paging/sorting to confirm which parameters drive results
How do you scrape infinite scroll pages? +

Infinite scroll is usually just pagination via background requests. The best method is to capture the paging calls (offset/cursor) and iterate them directly. If that’s not possible, a headless browser can scroll and extract — but it’s typically more expensive to run.

Can you scrape logged-in pages or authenticated APIs? +

Often yes, but it requires session management: automating login, storing cookies/tokens securely, and refreshing credentials when they expire. The right approach depends on how the site handles authentication.

Need a web crawler developed?

If you need a crawler that reliably extracts data from JavaScript / AJAX pages — and keeps working as sites evolve — Potent Pages builds and runs managed crawlers for law firms, financial firms, and enterprises.

  • Scoping: confirm feasibility (XHR vs rendering) and define the schema
  • Build: durable extraction + scheduling + retries
  • Operate: monitoring, alerts, and repairs when sites change
Next step: Tell us the site(s), the fields you need, and how often you need updates. We’ll recommend the approach and delivery format that fits your workflow.

    Contact Us








    David Selden-Treiman, Director of Operations at Potent Pages.

    David Selden-Treiman is Director of Operations and a project manager at Potent Pages. He specializes in custom web crawler development, website optimization, server management, web application development, and custom programming. Working at Potent Pages since 2012 and programming since 2003, David has extensive expertise solving problems using programming for dozens of clients. He also has extensive experience managing and optimizing servers, managing dozens of servers for both Potent Pages and other clients.

    Web Crawlers

    Data Collection

    There is a lot of data you can collect with a web crawler. Often, xpaths will be the easiest way to identify that info. However, you may also need to deal with AJAX-based data.

    Development

    Deciding whether to build in-house or finding a contractor will depend on your skillset and requirements. If you do decide to hire, there are a number of considerations you'll want to take into account.

    It's important to understand the lifecycle of a web crawler development project whomever you decide to hire.

    Web Crawler Industries

    There are a lot of uses of web crawlers across industries to generate strategic advantages and alpha. Industries benefiting from web crawlers include:

    Building Your Own

    If you're looking to build your own web crawler, we have the best tutorials for your preferred programming language: Java, Node, PHP, and Python. We also track tutorials for Apache Nutch, Cheerio, and Scrapy.

    Legality of Web Crawlers

    Web crawlers are generally legal if used properly and respectfully.

    Hedge Funds & Custom Data

    Custom Data For Hedge Funds

    Developing and testing hypotheses is essential for hedge funds. Custom data can be one of the best tools to do this.

    There are many types of custom data for hedge funds, as well as many ways to get it.

    Implementation

    There are many different types of financial firms that can benefit from custom data. These include macro hedge funds, as well as hedge funds with long, short, or long-short equity portfolios.

    Leading Indicators

    Developing leading indicators is essential for predicting movements in the equities markets. Custom data is a great way to help do this.

    Web Crawler Pricing

    How Much Does a Web Crawler Cost?

    A web crawler costs anywhere from:

    • nothing for open source crawlers,
    • $30-$500+ for commercial solutions, or
    • hundreds or thousands of dollars for custom crawlers.

    Factors Affecting Web Crawler Project Costs

    There are many factors that affect the price of a web crawler. While the pricing models have changed with the technologies available, ensuring value for money with your web crawler is essential to a successful project.

    When planning a web crawler project, make sure that you avoid common misconceptions about web crawler pricing.

    Web Crawler Expenses

    There are many factors that affect the expenses of web crawlers. In addition to some of the hidden web crawler expenses, it's important to know the fundamentals of web crawlers to get the best success on your web crawler development.

    If you're looking to hire a web crawler developer, the hourly rates range from:

    • entry-level developers charging $20-40/hr,
    • mid-level developers with some experience at $60-85/hr,
    • to top-tier experts commanding $100-200+/hr.

    GPT & Web Crawlers

    GPTs like GPT4 are an excellent addition to web crawlers. GPT4 is more capable than GPT3.5, but not as cost effective especially in a large-scale web crawling context.

    There are a number of ways to use GPT3.5 & GPT 4 in web crawlers, but the most common use for us is data analysis. GPTs can also help address some of the issues with large-scale web crawling.

    Scroll To Top