How web scraping actually works - and why AI changes everything

Getty/panithan pholpanichrassamee ZDNET's key takeaways Web scraping powers pricing, SEO, security, AI, and research industries. AI scraping threatens site survival by bypassing traffic return. Companies fight back with licensing, paywalls, and crawler blocks. Get more in-depth ZDNET tech coverage: Add us as a preferred Google source on Chrome and Chromium browsers. In the world of industrial web scraping, there are a few major players. Oh, you did not know there was a world of industrial web scraping? Have I got a story for you. Let's start by defining web scraping. Web scraping is the practice of extracting data from live web pages, the pages the public sees when visiting a website. Also: Fed up with AI scraping your content? This open-source bot blocker can help - here's how This is different from getting data via programmatic API (application programming interface) calls that the provider of the web page makes available, or from a database, or other downloadable information. Web scraping is extracting data that the web page owner has not officially made available for data analysis, and, in some cases, actively does not want to make available for external data analysis. Web scraping example Let's look at an example. Let's say you're a vendor with 200 individual products you sell online. Your products are fairly price sensitive, which is to say that if a competitor starts selling a similar product at a lower price, you need to be able to respond and lower your price as well. You need to be able to react to market forces fairly quickly, so tasking a bunch of employees to constantly refresh hundreds of web pages and note results in a spreadsheet just will not do. You need an automated process. Also: Perplexity says Cloudflare's accusations of 'stealth' AI scraping are based on embarrassing errors Let's further assume your products, as well as those of your competitor, are sold at popular online marketplaces like Amazon and Walmart. Both of these resellers provide tracking data on your products, but they will not share your competitors' data with you. Yet you need that data. The solution is web scraping, using an automated process to visit the web pages containing your competitors' products and extracting current pricing information from the underlying HTML structure of the page. That data can then be fed into your internal databases, and your internal systems can then update your prices accordingly. This scanning cycle might happen daily or a few times a week, keeping your products competitively priced and your customers happy. Other web scraping applications Industrial web scraping, where businesses scrape the web for data, is done for a variety of reasons. We just saw an application where a company uses web scraping for competitive information that drives business insights and informed decision-making. In addition to dynamic pricing, companies might want to have a clear view of available inventory and even new product listings from competitors. They might also want to keep an eye on top products, reviews, and more. Some businesses use web scraping to provide data as a service, whether that is real estate market data, sales leads, or any other aggregate of data that other companies find useful. If you've ever used an SEO monitoring tool or keyword ranking tool, you've probably been a consumer of web-scraped data provided as a service. The companies providing these services have to scan live sites (like Google) and pull down information that is then categorized and processed to provide up-to-date SEO analytics. Also: How to get rid of AI Overviews in Google Search: 4 easy ways There are also security and intellectual property protection applications for web scraping. For those with valuable brands, there is justification in scanning live web pages of commerce sites (as well as other classes of websites) for inappropriate or illegal use of your brands. The US Department of Commerce says counterfeiting is the "largest criminal enterprise in the world," putting estimates of pirated and counterfeited goods at an almost incomprehensible $1.7 to $4.5 trillion per year. Unfortunately, the government cannot stop this behavior, which leaves it up to individual brand owners to mount their own defense. An important use of web scraping in this context is identifying counterfeit product offerings, and then initiating the process to get those counterfeit products removed from the market. Other web scraping uses include threat intelligence, phishing protection, flight and hotel pricing information, aggregating data on trends for market research, and even data used for AI training and academic research. Two sides of the scraping coin: search and AI Getty/Weiquan Lin Web scraping is not new. In fact, it's just about as old as the web. Think about search engines. In order for you to type something into Google and get back a list of web pages that include the topic you're searching for, the search engine has to have already spidered, scraped, and indexed the sites it points you to. Let's talk about helminths (intestinal worms) for a moment. That's a hard transition, but I promise it's relevant. When my dog eats poop, we have to give him deworming medicine so he does not get sick. But as Helena Helmby shows in the journal BMC Immunology, beneficial parasitical worm species like trichuris trichiura or necator americanus can help treat autoimmune disorders like Crohn's disease and ulcerative colitis. Search engines are essentially beneficial parasites living off the work of individual website providers. They're beneficial because although they scrape the web, they send traffic back to the sites they scrape. The entire world of SEO became a thing because of how much traffic Google search sends to websites. Also: AI bots scraping your data? This free tool gives those pesky crawlers the run-around But then there's AI. AI is a lot like the parasitical sea lamprey (petromyzon marinus), an agnatha (basically a jawless fish). Sea lampreys can grow up to four feet long. They attach themselves to other large fish with a suction mouth, scrape away a hole in the host's skin, and feed on blood and bodily fluids. These creatures devastated Great Lakes fisheries in the early 20th century. Later techniques, including poison, barriers, and trapping, have reduced the problem considerably. AI scraping is parasitical behavior that's devastating website traffic. The AIs pull in information (like from this article) and then, instead of sending readers to the site where an author wrote the piece, simply present that information before anyone visits a site. I wrote a lot about this phenomenon and some of the protections that are starting to be deployed in How AI companies are secretly collecting training data from the web (and why it matters). That will bring you up to speed on the issue in more depth. Both search and AI use the results of absolutely ginormous scraping and spidering operations, but one provides benefits to the scrapees, while the other profits enormously from the work of others while simultaneously destroying their motivation to keep doing the work. (Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) The challenges of web scraping Web scraping on an industrial level requires large-scale data acquisition efforts. This generally involves the use of an automated bot that retrieves web pages for analysis and curation. Unfortunately, at least from the point of view of web scrapers, most web servers detect and block repeated page accesses, even to publicly facing pages. If the e-commerce company from our case study needs to update pricing on 200 products, it will probably need to generate a few thousand web page retrieval requests. This volume of retrieval would likely be blocked by any web server receiving those requests. This makes it fairly difficult for individual companies to do their own web scraping in-house. Instead, a small cadre of companies has formed to offer web scraping as a service. At their core is the ability to split web scraping requests among thousands of individual computers, using them as proxies for data retrieval. While some scrapers do use data center-based proxy servers, the practice is often defeated at the website level, because all those scraping requests come from one IP address cluster or geolocation. Instead, a more gold-standard practice is to use individual residential computers all across a targeted geography (often homes in the US). Also: How ChatGPT actually works (and why it's been so game-changing) Scraping requests are then distributed among the home computers. Each computer retrieves a web page. Then all those computers return those retrieved pages back to servers at the scraping-as-a-service provider, who then manages the data for customers. This leads to another obvious challenge. How, exactly, do you get thousands to hundreds of thousands of home computers to work in concert to do web scraping? And how do you do it legally and ethically, with the consent of the home computer owners? First of all, it's not always done legally or ethically. Malware plays a large part in distributing bots to thousands or even millions of end-user computers, which can then be "mind-controlled" into doing searches and scraping activities at scale. There are, however, some companies that do web scraping legally and ethically, while also processing data in great volume. These companies pay a small stipend to end users who voluntarily give up a few cycles of processing power and a few bytes of bandwidth to scraper client programs, who feed the results back to central repositories. We spotlighted one such ethical scraper in my article, This proxy provider I tested is the best for web scraping -- and it's not IPRoyal or MarsProxies. Where do we go from here? While scraping will likely always be a part of data acquisition practice, some companies have opted to make their data available officially and for a fee. Reddit, for example, is giving OpenAI access to its enormous library of fanbois screaming into the wind about this or that topic. Rather than scrape Reddit without approval, OpenAI will be able to use an API (application programming interface) to retrieve data more efficiently. Of course, whether we want our AIs to base their knowledge on data from Reddit is another thing entirely. Also: Reddit blocks the Internet Archive from crawling its data - here's why Reddit is not alone, of course. Many companies have started to license their data to the AIs. While this does not reduce the scraping or the traffic erosion, it does provide something of an alternative revenue stream for the previous victims of scraping activity. This is not an issue that's going away. One other approach to defend against malicious scraping has been implemented by edge traffic monitor Cloudflare. About 20% of Internet traffic flows through its servers. Cloudflare is blocking AI web crawlers by default (unless they get paid, 'natch). The bottom line is that web scraping is all about money. Whether money is spent bypassing restrictions to hoover up someone else's work, or money is spent to block that activity, or money is spent to get permission to extract that data and thereby reduce the overall value of the property, it's all about money. Lots and lots of money. Those of us who toil to create the content consumed by these robots are merely caught in the crossfire. How do you feel about the growing use of web scraping by AI companies compared to search engines? Do you think licensing deals like Reddit's are a fair solution, or do they just legitimize the loss of site traffic? Should web scraping be more tightly regulated, or is it an unavoidable part of the modern Internet? Let us know in the comments below. You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

How web scraping actually works - and why AI changes everything

Share this article

Related Articles