Configuration

User Agent Configuration

Crawlers should always act responsibly and ethically when accessing websites. A key aspect of this is properly identifying themselves through the User-Agent header. By providing a clear and accurate user agent string, webmasters can understand who is visiting their site and why, and can apply rules in robots.txt accordingly. Respecting these rules, avoiding excessive request rates, and honoring content restrictions not only ensures legal compliance but also maintains a healthy relationship with the web community. Transparent identification is a fundamental part of ethical web crawling.

The configuration of the user agent in StormCrawler has two purposes:

  1. Identification of the crawler for webmasters

  2. Selection of rules from robots.txt

Crawler Identification

The politeness of a web crawler is not limited to how frequently it fetches pages from a site, but also in how it identifies itself to sites it crawls. This is done by setting the HTTP header User-Agent, just like your web browser does.

The full user agent string is built from the concatenation of the configuration elements:

  • http.agent.name: name of your crawler

  • http.agent.version: version of your crawler

  • http.agent.description: description of what it does

  • http.agent.url: URL webmasters can go to to learn about it

  • http.agent.email: an email so that they can get in touch with you

Whereas StormCrawler used to provide a default value for these, this is not the case since version 2.11 and you will now be asked to provide a value.

You can specify the user agent verbatim with the config http.agent but you will still need to provide a http.agent.name for parsing robots.txt files.

Robots Exclusion Protocol

This is also known as the robots.txt protocol, it is formalised in RFC 9309. Part of what the robots directives do is to define rules to specify which parts of a website (if any) are allowed to be crawled. The rules are organised by User-Agent, with a * to match any agent not otherwise specified explicitly, e.g.:

User-Agent: *
Disallow: *.gif$
Disallow: /example/
Allow: /publications/

In the example above the rule allows access to the URLs with the /publications/ path prefix, and it restricts access to the URLs with the /example/ path prefix and to all URLs with a .gif suffix. The "*" character designates any character, including the otherwise-required forward slash.

The value of http.agent.name is what StormCrawler looks for in the robots.txt. It MUST contain only uppercase and lowercase letters ("a-z" and "A-Z"), underscores ("_"), and hyphens ("-").

Unless you are running a well known web crawler, it is unlikely that its agent name will be listed explicitly in the robots.txt (if it is, well, congratulations!). While you want the agent name value to reflect who your crawler is, you might want to follow rules set for better known crawlers. For instance, if you were a responsible AI company crawling the web to build a dataset to train LLMs, you would want to follow the rules set for Google-Extended (see list of Google crawlers) if any were found.

This is what the configuration http.robots.agents allows you to do. It is a comma-separated string but can also take a list of values. By setting it alongside http.agent.name (which should also be the first value it contains), you are able to broaden the match rules based on the identity as well as the purpose of your crawler.

Proxy

StormCrawler’s proxy system is built on top of the SCProxy class and the ProxyManager interface. Every proxy used in the system is formatted as a SCProxy. The ProxyManager implementations handle the management and delegation of their internal proxies. At the call of Protocol#getProtocolOutput(), the ProxyManager.getProxy() is called to retrieve a proxy for the individual request.

The ProxyManager interface can be implemented in a custom class to create custom logic for proxy management and load balancing. The default ProxyManager implementation is SingleProxyManager. This ensures backwards compatibility for prior StormCrawler releases. To use MultiProxyManager or custom implementations, pass the class path and name via the config parameter http.proxy.manager:

http.proxy.manager: "org.apache.stormcrawler.proxy.MultiProxyManager"

StormCrawler implements two ProxyManager classes by default:

  • SingleProxyManager Manages a single proxy passed by the backwards compatible proxy fields in the configuration:

    ----
    http.proxy.host
    http.proxy.port
    http.proxy.type
    http.proxy.user (optional)
    http.proxy.pass (optional)
    ----
  • MultiProxyManager Manages multiple proxies passed through a TXT file. The file should contain connection strings for all proxies including the protocol and authentication (if needed). The file supports comment lines (// or #) and empty lines. The file path should be passed via the config at the below field. The TXT file must be available to all nodes participating in the topology:

    ----
    http.proxy.file
    ----

The MultiProxyManager load balances across proxies using one of the following schemes. The load balancing scheme can be passed via the config using http.proxy.rotation; the default value is ROUND_ROBIN:

  • ROUND_ROBIN Evenly distributes load across all proxies

  • RANDOM Randomly selects proxies using the native Java random number generator. RNG is seeded with the nanos at instantiation

  • LEAST_USED Selects the proxy with the least amount of usage. This is performed lazily for speed and therefore will not account for changes to usages during the selection process. If no custom implementations are made this should theoretically operate the same as ROUND_ROBIN

The SCProxy class contains all of the information associated with proxy connection. In addition, it tracks the total usage of the proxy and optionally tracks the location of the proxy IP. Usage information is used for the LEAST_USED load balancing scheme. The location information is currently unused but left to enable custom implementations the ability to select proxies by location.

Metadata

Registering Metadata for Kryo Serialization

If your Apache StormCrawler topology doesn’t extend org.apache.storm.crawler.ConfigurableTopology, you will need to manually register StormCrawler’s Metadata class for serialization in Storm. For more information on Kryo serialization in Apache Storm, you can refer to the documentation.

To register Metadata for serialization, you’ll need to import backtype.storm.Config and org.apache.storm.crawler.Metadata. Then, in your topology class, you’ll register the class with:

Config.registerSerialization(conf, Metadata.class);

where conf is your Storm configuration for the topology.

Alternatively, you can specify in the configuration file:

topology.kryo.register:
  - org.apache.storm.crawler.Metadata

MetadataTransfer

The class MetadataTransfer is an important part of the framework and is used in key parts of a pipeline:

  • Fetching

  • Parsing

  • Updating bolts

An instance (or extension) of MetadataTransfer gets created and configured with the method:

public static MetadataTransfer getInstance(Map<String, Object> conf)

which takes as parameter the standard Storm .

A MetadataTransfer instance has mainly two methods, both returning Metadata objects:

  • getMetaForOutlink(String targetURL, String sourceURL, Metadata parentMD)

  • filter(Metadata metadata)

The former is used when creating Outlinks, i.e., in the parsing bolts but also for handling redirections in the [[FetcherBolt(s)]].

The latter is used by extensions of the AbstractStatusUpdaterBolt class to determine which Metadata should be persisted.

The behavior of the default MetadataTransfer class is driven by configuration only. It has the following options:

  • metadata.transfer:: list of metadata key values to filter or transfer to the outlinks. Please see the corresponding comments in crawler-default.yaml

  • metadata.persist:: list of metadata key values to persist in the status storage. Please see the corresponding comments in crawler-default.yaml

  • metadata.track.path:: whether to track the URL path or not. Boolean value, true by default.

  • metadata.track.depth:: whether to track the depth from seed. Boolean value, true by default.

Note that the method getMetaForOutlink calls filter to determine which key values to keep.

Configuration Options

The following tables describe all available configuration options and their default values. If one of the keys is not present in your YAML file, the default value will be taken.

Note: Some configuration options may not be applicable depending on the specific components and features you are using in your Apache StormCrawler topology. Some external modules might define additional options not listed here.

Fetching and Partitioning

key default value description

fetcher.max.crawl.delay

30

The maximum number in seconds that will be accepted by Crawl-delay directives in robots.txt files. If the crawl-delay exceeds this value the behavior depends on the value of fetcher.max.crawl.delay.force.

fetcher.max.crawl.delay.force

false

Configures the behavior of fetcher if the robots.txt crawl-delay exceeds fetcher.max.crawl.delay. If false: the tuple is emitted to the StatusStream as an ERROR. If true: the queue delay is set to fetcher.max.crawl.delay.

fetcher.max.queue.size

-1

The maximum length of the queue used to store items to be fetched by the FetcherBolt. A setting of -1 sets the length to Integer.MAX_VALUE.

fetcher.max.throttle.sleep

-1

The maximum amount of time to wait between fetches; if exceeded, the item is sent to the back of the queue. Used in SimpleFetcherBolt. -1 disables it.

fetcher.max.urls.in.queues

-1

Limits the number of URLs that can be stored in a fetch queue. -1 disables the limit.

fetcher.maxThreads.host/domain/ip

fetcher.threads.per.queue

Overwrites fetcher.threads.per.queue. Useful for crawling some domains/hosts/IPs more intensively.

fetcher.metrics.time.bucket.secs

10

Metrics events emitted every value seconds to the system stream.

fetcher.queue.mode

byHost

Possible values: byHost, byDomain, byIP. Determines queue grouping.

fetcher.server.delay

1

Delay between crawls in the same queue if no Crawl-delay is defined.

fetcher.server.delay.force

false

Defines fetcher behavior when the robots.txt crawl-delay is smaller than fetcher.server.delay.

fetcher.server.min.delay

0

Delay between crawls for queues with >1 thread. Ignores robots.txt.

fetcher.threads.number

10

Total concurrent threads fetching pages. Adjust carefully based on system capacity.

fetcher.threads.per.queue

1

Default number of threads per queue. Can be overridden.

fetcher.timeout.queue

-1

Maximum wait time (seconds) for items in the queue. -1 disables timeout.

fetcherbolt.queue.debug.filepath

""

Path to a debug log (e.g. /tmp/fetcher-dump-{port}).

http.agent.description

-

Description for the User-Agent header.

http.agent.email

-

Email address in User-Agent header.

http.agent.name

-

Name in User-Agent header.

http.agent.url

-

URL in User-Agent header.

http.agent.version

-

Version in User-Agent header.

http.basicauth.password

-

Password for http.basicauth.user.

http.basicauth.user

-

Username for Basic Authentication.

http.content.limit

-1

Maximum HTTP response body size (bytes). Default: no limit.

http.protocol.implementation

org.apache.stormcrawler.protocol.httpclient.HttpProtocol

HTTP Protocol implementation.

http.proxy.host

-

HTTP proxy host.

http.proxy.pass

-

Proxy password.

http.proxy.port

8080

Proxy port.

http.proxy.user

-

Proxy username.

http.robots.403.allow

true

Defines behavior when robots.txt returns HTTP 403.

http.robots.agents

''

Additional user-agent strings for interpreting robots.txt.

http.robots.file.skip

false

Ignore robots.txt rules (1.17+).

http.skip.robots

false

Deprecated (replaced by http.robots.file.skip).

http.store.headers

false

Whether to store response headers.

http.store.responsetime

true

Not yet implemented — store response time in Metadata.

http.timeout

10000

Connection timeout (ms).

http.use.cookies

false

Use cookies in subsequent requests.

https.protocol.implementation

org.apache.stormcrawler.protocol.httpclient.HttpProtocol

HTTPS Protocol implementation.

partition.url.mode

byHost

Defines how URLs are partitioned: byHost, byDomain, or byIP.

protocols

http,https

Supported protocols.

redirections.allowed

true

Allow URL redirects.

sitemap.discovery

false

Enable automatic sitemap discovery.

Protocol

key default value description

cacheConfigParamName

maximumSize=10000,expireAfterWrite=6h

CacheBuilder configuration for robots cache.

errorcacheConfigParamName

maximumSize=10000,expireAfterWrite=1h

CacheBuilder configuration for error cache.

file.encoding

UTF-8

Encoding for FileProtocol.

http.custom.headers

-

Custom HTTP headers.

http.accept

-

HTTP Accept header.

http.accept.language

-

HTTP Accept-Language header.

http.content.partial.as.trimmed

false

Accepts partially fetched content in OKHTTP.

http.trust.everything

true

If true, trust all SSL/TLS connections.

navigationfilters.config.file

-

JSON config for NavigationFilter. See blog post .

selenium.addresses

-

WebDriver server addresses.

selenium.capabilities

-

Desired WebDriver capabilities .

selenium.delegated.protocol

-

Delegated protocol for selective Selenium usage.

selenium.implicitlyWait

0

WebDriver element search timeout.

selenium.instances.num

1

Number of instances per WebDriver connection.

selenium.pageLoadTimeout

0

WebDriver page load timeout.

selenium.setScriptTimeout

0

WebDriver script execution timeout.

topology.message.timeout.secs

-1

OKHTTP message timeout.

Indexing

The values below are used by sub-classes of AbstractIndexerBolt.

key default value description

indexer.md.filter

-

YAML list of key=value filters for metadata-based indexing.

indexer.md.mapping

-

YAML mapping from metadata fields to persistence layer fields.

indexer.text.fieldname

-

Field name for indexed HTML body text.

indexer.url.fieldname

-

Field name for indexed URL.

Status Persistence

This refers to persisting the status of a URL (e.g. ERROR, DISCOVERED etc.) along with asomething like a nextFetchDate that is being calculated by a Scheduler.

key default value description

fetchInterval.default

1440

Default revisit interval (minutes). Used by DefaultScheduler .

fetchInterval.error

44640

Revisit interval for error pages (minutes).

fetchInterval.fetch.error

120

Revisit interval for fetch errors (minutes).

status.updater.cache.spec

maximumSize=10000, expireAfterAccess=1h

Cache specification .

status.updater.use.cache

true

Whether to use cache to avoid re-persisting URLs.

Parsing

Configures parsing of fetched text and the handling of discovered URIs

key default value description

collections.file

collections.json

Config for CollectionTagger .

collections.key

collections

Key under which tags are stored in metadata.

feed.filter.hours.since.published

-1

Discard feeds older than value hours.

feed.sniffContent

false

Try to detect feeds automatically.

parsefilters.config.file

parsefilters.json

Path to JSON config defining ParseFilters. See example .

parser.emitOutlinks

true

Emit discovered links as DISCOVERED tuples.

parser.emitOutlinks.max.per.page

-1

Limit number of emitted links per page.

textextractor.exclude.tags

""

HTML tags ignored by TextExtractor.

textextractor.include.pattern

""

Regex patterns to include for TextExtractor.

textextractor.no.text

false

Disable text extraction entirely.

track.anchors

true

Add anchor text to outlink metadata.

urlfilters.config.file

urlfilters.json

JSON file defining URL filters. See default .

Metadata

Options on how Storm Crawler should handle metadata tracking as well as minimising metadata clashes

key default value description

metadata.persist

-

Metadata to persist but not transfer to outlinks.

metadata.track.depth

true

Track crawl depth of URLs.

metadata.track.path

true

Track URL path history in metadata.

metadata.transfer

-

Metadata to transfer to outlinks.

metadata.transfer.class

org.apache.stormcrawler.util.MetadataTransfer

Class handling metadata transfer.

protocol.md.prefix

-

Prefix for remote metadata keys to avoid collisions.