Large-scale crawling doesn’t have to be expensive. Here’s how we proved it.

When facing the task of collecting hundreds of millions of pages, the first question is always the same: how much will this cost? Standard cloud instances, whether GCP, AWS, or Azure – quickly generate bills that put a hard ceiling on ambition. That’s why, for a project requiring 400 million pages indexed, we bet on Google Cloud Preemptible VMs (also known as Spot VMs). The result: $2 per million pages crawled.

Here’s why this works so well for crawling and what you need to know before deploying.


What are Preemptible VMs (Spot)

A Google Cloud Preemptible VM runs on GCP’s surplus compute capacity. In exchange for allowing Google to reclaim the instance at any time, you get a machine of the same type and performance as a standard VM, at up to 91% off.

The legacy “preemptible” label refers to the older variant with a 24-hour maximum runtime. The newer form, Spot VM – removes that limit, and Google officially recommends using Spot VMs going forward. Both variants share the same pricing model.

Key characteristics:

  • Fixed price within a given day (changes no more than once per day, in practice far less often)
  • Google can stop the instance at any time, giving 30 seconds for a graceful shutdown
  • If an instance is stopped within the first minute, you pay nothing for that time
  • All machine types available, including GPU and Local SSD configurations

Why crawling and preemptible are a perfect match

Crawling is inherently a fault-tolerant, massively parallelizable workload. Every URL to fetch is an independent task. If a worker dies mid-batch – the URL queue doesn’t disappear, another worker picks up the work. There’s no shared state to protect.

This is precisely the type of workload preemptible instances were designed for:

Every task is atomic. Fetching a page takes seconds. Even aggressive preemption wastes almost no completed work – at worst, a handful of URLs return to the queue.

Horizontal scaling is natural. Instead of one large machine, you run dozens of small ones. At preemptible pricing, the same aggregate compute power costs several times less.

Interruptions don’t affect the final output. You don’t care about any particular instance staying alive – you care about the URL queue being progressively drained. Preemption only slows the rate, it doesn’t corrupt data.

Work pattern matches spot availability. Crawling is rarely a real-time task, a few hours’ delay in processing a batch is acceptable. This gives a buffer for any temporary capacity constraints in the spot pool.


The result: 400 million pages at $2 per million

We ran a fleet of small instances in preemptible mode. Each worker pulled URLs from a central Redis queue, wrote results to GCS, and marked the task complete.

Total compute cost for 400 million pages: ~$800. That’s $2 per million pages.

For comparison: the same volume on standard on-demand instances would have cost 8–12x more.


One real downside: no Docker

There’s one tradeoff worth knowing upfront.

Preemptible VMs – or more precisely, managing a fleet of them through Managed Instance Groups, works best with a gold VM image, not with Docker containers launched via systemd or equivalent.

Why? A Managed Instance Group creates new instances from a template. If that template is a VM image with the full crawler environment already baked in, a new machine starts in seconds and immediately begins processing. It doesn’t need to pull a Docker image, doesn’t depend on an external registry being reachable, has no cold start from downloading dependencies.

In practice this means:

  • You build a gold image once (with Packer or manually), baking in Scrapy, Playwright, and the agent config
  • An Instance Template in GCP points to that image
  • The Managed Instance Group launches N instances from the template, each ready to work from the first second of boot

This is a slightly different workflow than Docker Compose or Kubernetes, but for this class of task it’s simpler and more reliable. No registries, no layers, no container-level health checks. The VM boots, the crawler script starts with it.

Key design principles:

  1. Each worker pulls a batch of 100 URLs from the queue broker, fetches the pages, processes and writes results to the database, marks the tasks as done, then goes back to the broker for the next batch. That’s the entire worker loop, no other inter-machine communication needed.
  2. The queue broker runs on a single small on-demand instance. This one machine is never preemptible, it’s the only part of the system that must stay up continuously. Everything else can be stopped and replaced at any time.
  3. The number of parallel workers is entirely up to you. Want to index faster, spin up more instances. Want to cut costs – shrink the fleet. The MIG can do this automatically based on queue depth, but you can also manage it manually.

When this approach makes sense and when it doesn’t

Well suited for:

  • Large one-shot crawls (one-time indexing runs)
  • Recurring indexing with acceptable latency
  • Any workload where data flows through a queue rather than direct machine-to-machine connections

Less suited when:

  • The crawl must finish within a strict, short time window with no tolerance for stalls
  • You’re using Playwright with long-running browser sessions (preemption mid-session requires careful state management)
  • Your sources rate-limit per IP and instance restarts rotate IPs at inconvenient moments

Takeaway

Google Cloud Preemptible / Spot VMs are one of the most underrated tools in a crawler’s toolbox. The crawling workload is almost perfectly suited to the spot model atomic tasks, fault-tolerant by design, trivially parallelizable.

The only real tradeoff is moving away from Docker as the deployment mechanism in favor of a baked VM image. In practice, this is actually a simpler and faster approach for managing a fleet of machines doing this kind of work.

The numbers speak for themselves: 400 million pages, $2 per million.


Planning a similar project ? building your own crawl infrastructure or outsourcing the indexing entirely, get in touch.