How Scanning Works

A detailed look at the Brokenly crawl pipeline — from sitemap fetch to link status assignment.

Understanding the crawl pipeline helps you get the most out of Brokenly and explains why a given link has the status it has.

The Crawl Pipeline

1. Sitemap Fetch

Brokenly fetches your sitemap XML. If the URL is a sitemap index (a file pointing to multiple sitemaps), we fetch each sub-sitemap and merge the page list.

2. Page Crawl

Each page in the sitemap is visited and its HTML is downloaded. Brokenly tracks how many pages it has crawled so far — you'll see this in the live crawl progress banner.

Every outbound <a> tag on the page is extracted. Brokenly classifies a link as an affiliate link based on known affiliate network URL patterns — Amazon Associates, ShareASale, CJ, Impact, and others — plus common affiliate redirect domains.

Each identified affiliate link is checked with an HTTP request. Brokenly follows redirects, records the final destination URL and the HTTP status code, and classifies the result:

  • 2xxHealthy
  • 4xx / 5xxBroken
  • 403 specifically → Blocked
  • Redirects ending at the merchant's homepage → Redirects to Homepage
  • Redirects ending at the merchant's search page → Redirects to Search
  • Timeouts, rate limits, or unexpected errors → Could Not Verify

5. Amazon Availability Check

For links Brokenly identifies as Amazon products, an extra availability check runs to detect Out of Stock and Unavailable products that Amazon would otherwise return 200 OK for. See Amazon Health Check.

Crawl Duration

Crawl time depends on:

  • Number of pages in your sitemap
  • Number of affiliate links per page
  • Response times of the affiliate networks being checked

Most sites finish a full crawl in a few minutes. You'll see live progress — pages crawled, links found, links checked — while it runs.

Plan Limits

Your plan sets a maximum number of links that can be checked per cycle (500 on Starter through to 50,000+ on Agency). If a crawl finds more links than your remaining quota allows, Brokenly checks as many as it can and carries the rest over to the next crawl so nothing is missed permanently.

What Brokenly Does Not Index

  • Pages not listed in your sitemap
  • Links inside iframes
  • Links injected dynamically by JavaScript after page load
  • Password-protected pages

If you need Brokenly to check pages that aren't in your sitemap, contact us.