Documentation

This page provides documentation resources for Apache StormCrawler

User Documentation

Additional documentation, including configuration guides and usage examples, is available in the versioned documentation pages:

3.5.1 (current)

Javadoc

The full API reference for Apache StormCrawler Javadoc.io:

FAQ

Q: Topologies? Spouts? Bolts? I’m confused!

A: If you’re new to these concepts, it’s worth starting with Apache Storm®. The tutorial and concept pages provide a good introduction. In addition, you can have a look at our own documentation, which provides a quick overview.

Q: Do I need an Apache Storm® cluster to run StormCrawler?

A: Not necessarily. StormCrawler can run in local mode, using Storm libraries as dependencies. However, installing Storm in pseudo-distributed mode is useful if you want to use its UI to monitor topologies.

Q: Why use Apache Storm®?

A: Apache Storm® is a robust, fault-tolerant framework for distributed stream processing. It guarantees data processing, is simple to understand, actively maintained, and licensed under ASF 2.0.

Q: How fast is StormCrawler?

A: Speed depends on the diversity of hostnames, your politeness settings and execution environment. For example, if you have 1 million URLs from the same host and set a 1-second delay between requests, you can fetch a maximum of ~86,400 pages per day. Actual performance will vary depending on network speed, document size, parsing, and indexing overhead. This is true for any crawler.

Q: Why choose StormCrawler over Apache Nutch?

A: StormCrawler processes URLs as a continuous stream, indexing them as they are fetched. Nutch uses batch steps, which can slow down as the crawl grows and resources are unevenly used. StormCrawler can handle streaming URLs or low-latency use cases efficiently. It is also more modern, modular, and actively maintained, though Nutch excels in advanced scoring and deduplication.

Tutorials comparing both are available here: Nutch & SC with CloudSearch and a benchmark study: Crawlers benchmark.

Q: Do I need external storage? Which type?

A: Yes, StormCrawler needs storage for URLs. The type depends on your crawl:

Non-recursive crawls: Messaging queues like RabbitMQ, AWS SQS, or Apache Kafka® work well. Use a Spout implementation to read from the queue.
Recursive crawls: Use storage with unique keys (e.g., a database) to avoid duplicate URLs. StormCrawler provides external modules for Apache SOLR, OpenSearch, SQL.

The modularity of StormCrawler lets you plug in almost any storage backend.

Q: Is StormCrawler polite?

A: Yes. It respects the robots.txt protocol and can be configured with a politness dely between requests to the same host or domain.

Q: How do I know when a crawl is finished?

A: Storm topologies run continuously by design, so there is no automatic “finished” state. You need to monitor progress and stop the crawl manually or implement a custom termination mechanism.