Crawl Order and Disorder
Published on: 2025-05-27 14:07:01
A problem the search engine’s crawler has struggled with for some time is that it takes a fairly long time to finish up, usually spending several days wrapping up the final few domains.
This has been actualized recently, since the migration to slop crawl data has dropped memory requirements of the crawler by something like 80%, and as such I’ve been able to increase the number of crawling tasks, which has led to a bizarre case where 99.9% of the crawling is done in 4 days, and the remaining 0.1% takes a week.
This happens for a few reasons, in part because the the sizes of websites seem to follow a pareto distribution and some sites are just very large, but also because the crawler limits how many concurrent crawl tasks are allowed per common domain name.
This limit is to avoid accidentally exceeding crawl rates by crawling the same site via different aliases. It’s also flat necessary to avoid getting blocked by anti-crawler software on some domains, especially in academia which ten
... Read full article.