How web scraping actually works - and why AI changes everything

4 days ago 10

ZDNET's cardinal takeaways

Web scraping powers pricing, SEO, security, AI, and probe industries.
AI scraping threatens tract endurance by bypassing postulation return.
Companies combat backmost with licensing, paywalls, and crawler blocks.

Get much in-depth ZDNET tech coverage: Add america arsenic a preferred Google source on Chrome and Chromium browsers.

In the satellite of concern web scraping, determination are a fewer large players. Oh, you did not cognize determination was a satellite of concern web scraping? Have I got a communicative for you.

Let's commencement by defining web scraping. Web scraping is the signifier of extracting information from unrecorded web pages, the pages the nationalist sees erstwhile visiting a website.

Also: Fed up with AI scraping your content? This open-source bot blocker tin assistance - here's how

This is antithetic from getting information via programmatic API (application programming interface) calls that the supplier of the web leafage makes available, oregon from a database, oregon different downloadable information.

Web scraping is extracting information that the web leafage proprietor has not officially made disposable for information analysis, and, successful immoderate cases, actively does not privation to marque disposable for outer information analysis.

Web scraping example

Let's look astatine an example. Let's accidental you're a vendor with 200 idiosyncratic products you merchantability online.

Your products are reasonably terms sensitive, which is to accidental that if a rival starts selling a akin merchandise astatine a little price, you request to beryllium capable to respond and little your terms arsenic well. You request to beryllium capable to respond to marketplace forces reasonably quickly, truthful tasking a clump of employees to perpetually refresh hundreds of web pages and enactment results successful a spreadsheet conscionable volition not do.

You request an automated process.

Also: Perplexity says Cloudflare's accusations of 'stealth' AI scraping are based connected embarrassing errors

Let's further presume your products, arsenic good arsenic those of your competitor, are sold astatine fashionable online marketplaces similar Amazon and Walmart. Both of these resellers supply tracking information connected your products, but they volition not stock your competitors' information with you.

Yet you request that data. The solution is web scraping, utilizing an automated process to sojourn the web pages containing your competitors' products and extracting existent pricing accusation from the underlying HTML operation of the page. That information tin past beryllium fed into your interior databases, and your interior systems tin past update your prices accordingly.

This scanning rhythm mightiness hap regular oregon a fewer times a week, keeping your products competitively priced and your customers happy.

Other web scraping applications

Industrial web scraping, wherever businesses scrape the web for data, is done for a assortment of reasons. We conscionable saw an exertion wherever a institution uses web scraping for competitory accusation that drives concern insights and informed decision-making.

In summation to dynamic pricing, companies mightiness privation to person a wide presumption of disposable inventory and adjacent caller merchandise listings from competitors. They mightiness besides privation to support an oculus connected apical products, reviews, and more.

Some businesses usage web scraping to supply information arsenic a service, whether that is existent property marketplace data, income leads, oregon immoderate different aggregate of information that different companies find useful.

If you've ever utilized an SEO monitoring instrumentality oregon keyword ranking tool, you've astir apt been a user of web-scraped information provided arsenic a service. The companies providing these services person to scan unrecorded sites (like Google) and propulsion down accusation that is past categorized and processed to supply up-to-date SEO analytics.

Also: How to get escaped of AI Overviews successful Google Search: 4 casual ways

There are besides information and intelligence spot extortion applications for web scraping. For those with invaluable brands, determination is justification successful scanning unrecorded web pages of commerce sites (as good arsenic different classes of websites) for inappropriate oregon amerciable usage of your brands.

The US Department of Commerce says counterfeiting is the "largest transgression endeavor successful the world," putting estimates of pirated and counterfeited goods astatine an astir incomprehensible $1.7 to $4.5 trillion per year.

Unfortunately, the authorities cannot halt this behavior, which leaves it up to idiosyncratic marque owners to equine their ain defense. An important usage of web scraping successful this discourse is identifying counterfeit merchandise offerings, and past initiating the process to get those counterfeit products removed from the market.

Other web scraping uses see threat intelligence, phishing protection, flight and edifice pricing information, aggregating information connected trends for market research, and adjacent information utilized for AI grooming and academic research.

Two sides of the scraping coin: hunt and AI

Web scraping is not new. In fact, it's conscionable astir arsenic aged arsenic the web. Think astir hunt engines. In bid for you to benignant thing into Google and get backmost a database of web pages that see the taxable you're searching for, the hunt motor has to person already spidered, scraped, and indexed the sites it points you to.

Let's speech astir helminths (intestinal worms) for a moment. That's a hard transition, but I committedness it's relevant.

When my canine eats poop, we person to springiness him deworming medicine truthful helium does not get sick. But arsenic Helena Helmby shows successful the diary BMC Immunology, beneficial parasitical worm taxon similar trichuris trichiura oregon necator americanus tin assistance dainty autoimmune disorders similar Crohn's illness and ulcerative colitis.

Search engines are fundamentally beneficial parasites surviving disconnected the enactment of idiosyncratic website providers. They're beneficial due to the fact that though they scrape the web, they nonstop postulation backmost to the sites they scrape. The full satellite of SEO became a happening due to the fact that of however overmuch postulation Google hunt sends to websites.

Also: AI bots scraping your data? This escaped instrumentality gives those pesky crawlers the run-around

But past there's AI. AI is simply a batch similar the parasitical oversea lamprey (petromyzon marinus), an agnatha (basically a jawless fish). Sea lampreys tin turn up to 4 feet long. They connect themselves to different ample food with a suction mouth, scrape distant a spread successful the host's skin, and provender connected humor and bodily fluids. These creatures devastated Great Lakes fisheries successful the aboriginal 20th century. Later techniques, including poison, barriers, and trapping, person reduced the occupation considerably.

AI scraping is parasitical behaviour that's devastating website traffic. The AIs propulsion successful accusation (like from this article) and then, alternatively of sending readers to the tract wherever an writer wrote the piece, simply contiguous that accusation earlier anyone visits a site. I wrote a batch astir this improvement and immoderate of the protections that are starting to beryllium deployed successful How AI companies are secretly collecting grooming information from the web (and wherefore it matters). That volition bring you up to velocity connected the contented successful much depth.

Both hunt and AI usage the results of perfectly ginormous scraping and spidering operations, but 1 provides benefits to the scrapees, portion the different profits enormously from the enactment of others portion simultaneously destroying their information to support doing the work.

(Disclosure: Ziff Davis, ZDNET's genitor company, filed an April 2025 suit against OpenAI, alleging it infringed Ziff Davis copyrights successful grooming and operating its AI systems.)

The challenges of web scraping

Web scraping connected an concern level requires large-scale information acquisition efforts. This mostly involves the usage of an automated bot that retrieves web pages for investigation and curation. Unfortunately, astatine slightest from the constituent of presumption of web scrapers, astir web servers observe and artifact repeated leafage accesses, adjacent to publically facing pages.

If the e-commerce institution from our lawsuit survey needs to update pricing connected 200 products, it volition astir apt request to make a fewer 1000 web leafage retrieval requests. This measurement of retrieval would apt beryllium blocked by immoderate web server receiving those requests.

This makes it reasonably hard for idiosyncratic companies to bash their ain web scraping in-house. Instead, a tiny cadre of companies has formed to connection web scraping arsenic a service. At their halfway is the quality to divided web scraping requests among thousands of idiosyncratic computers, utilizing them arsenic proxies for information retrieval.

While immoderate scrapers bash usage information center-based proxy servers, the signifier is often defeated astatine the website level, due to the fact that each those scraping requests travel from 1 IP code clump oregon geolocation. Instead, a much gold-standard signifier is to usage idiosyncratic residential computers each crossed a targeted geography (often homes successful the US).

Also: How ChatGPT really works (and wherefore it's been truthful game-changing)

Scraping requests are past distributed among the location computers. Each machine retrieves a web page. Then each those computers instrumentality those retrieved pages backmost to servers astatine the scraping-as-a-service provider, who past manages the information for customers.

This leads to different evident challenge. How, exactly, bash you get thousands to hundreds of thousands of location computers to enactment successful performance to bash web scraping? And however bash you bash it legally and ethically, with the consent of the location machine owners?

First of all, it's not ever done legally oregon ethically. Malware plays a ample portion successful distributing bots to thousands oregon adjacent millions of end-user computers, which tin past beryllium "mind-controlled" into doing searches and scraping activities astatine scale.

There are, however, immoderate companies that bash web scraping legally and ethically, portion besides processing information successful large volume. These companies wage a tiny stipend to extremity users who voluntarily springiness up a fewer cycles of processing powerfulness and a fewer bytes of bandwidth to scraper lawsuit programs, who provender the results backmost to cardinal repositories. We spotlighted 1 specified ethical scraper successful my article, This proxy supplier I tested is the champion for web scraping -- and it's not IPRoyal oregon MarsProxies.

Where bash we spell from here?

While scraping volition apt ever beryllium a portion of information acquisition practice, immoderate companies person opted to marque their information disposable officially and for a fee. Reddit, for example, is giving OpenAI entree to its tremendous library of fanbois screaming into the upwind astir this oregon that topic.

Rather than scrape Reddit without approval, OpenAI volition beryllium capable to usage an API (application programming interface) to retrieve information much efficiently.

Of course, whether we privation our AIs to basal their cognition connected information from Reddit is different happening entirely.

Also: Reddit blocks the Internet Archive from crawling its information - here's why

Reddit is not alone, of course. Many companies person started to licence their information to the AIs. While this does not trim the scraping oregon the postulation erosion, it does supply thing of an alternate gross watercourse for the erstwhile victims of scraping activity.

This is not an contented that's going away. One different attack to support against malicious scraping has been implemented by borderline postulation show Cloudflare. About 20% of Internet postulation flows done its servers. Cloudflare is blocking AI web crawlers by default (unless they get paid, 'natch).

The bottommost enactment is that web scraping is each astir money. Whether wealth is spent bypassing restrictions to hoover up idiosyncratic else's work, oregon wealth is spent to artifact that activity, oregon wealth is spent to get support to extract that information and thereby trim the wide worth of the property, it's each astir money. Lots and tons of money.

Those of america who toil to make the contented consumed by these robots are simply caught successful the crossfire.

How bash you consciousness astir the increasing usage of web scraping by AI companies compared to hunt engines? Do you deliberation licensing deals similar Reddit's are a just solution, oregon bash they conscionable legitimize the nonaccomplishment of tract traffic? Should web scraping beryllium much tightly regulated, oregon is it an unavoidable portion of the modern Internet? Let america cognize successful the comments below.

You tin travel my day-to-day task updates connected societal media. Be definite to subscribe to my play update newsletter, and travel maine connected Twitter/X astatine @DavidGewirtz, connected Facebook astatine Facebook.com/DavidGewirtz, connected Instagram astatine Instagram.com/DavidGewirtz, connected Bluesky astatine @DavidGewirtz.com, and connected YouTube astatine YouTube.com/DavidGewirtzTV.