Apache StormCrawler (Incubating) Migration Guide

Introduction

This guide provides step-by-step instructions for migrating your project from older versions of StormCrawler to the new version under the Apache umbrella. Key changes include updates to the group and artifact IDs, as well as the removal of the Elasticsearch module.

Group ID and Artifact ID Changes

Group ID

The group ID has changed from com.digitalpebble.stormcrawler to org.apache.stormcrawler. This change reflects the project's transition to the Apache Software Foundation.

Artifact ID

The artifact ID has changed from storm-crawler to stormcrawler.

Maven Configuration

Update your pom.xml to reflect these changes. Below is an example of the updated dependency configuration:

Old Configuration:

<dependency>
    <groupId>com.digitalpebble.stormcrawler</groupId>
    <artifactId>storm-crawler</artifactId>
    <version>OLD_VERSION</version>
</dependency>

New Configuration:

<dependency>
    <groupId>org.apache.stormcrawler</groupId>
    <artifactId>stormcrawler</artifactId>
    <version>NEW_VERSION</version>
</dependency>

Replace OLD_VERSION with the version you are currently using and NEW_VERSION with the latest version of Apache StormCrawler.

Removal of Elasticsearch Module

The Elasticsearch module has been removed in the latest version of StormCrawler. You have two options to handle this change:

  1. Fork the Elasticsearch Module: You can fork the Elasticsearch module from the older version of StormCrawler and maintain it independently.
  2. Migrate to OpenSearch Module: Alternatively, you can migrate your code to use the OpenSearch module provided by Apache StormCrawler.

Forking the Elasticsearch Module

  1. Clone the repository containing the last version of StormCrawler that includes the Elasticsearch module.
  2. Copy the Elasticsearch module into your project's repository.
  3. Update your project's dependencies to include this local version of the Elasticsearch module.

Migrating to OpenSearch Module

If you choose to migrate to the OpenSearch module, you will need to update your code to use the new module. Here are the steps:

  1. Add the OpenSearch dependency to your pom.xml:
<dependency>
    <groupId>org.apache.stormcrawler</groupId>
    <artifactId>stormcrawler-opensearch</artifactId>
    <version>NEW_VERSION</version>
</dependency>
  1. Update your code to replace any references to the Elasticsearch module with the corresponding OpenSearch module classes and methods. The API for OpenSearch is similar to Elasticsearch, so the changes should be straightforward.
  2. Test your application thoroughly to ensure that the migration does not introduce any issues.

Summary

By following the steps outlined in this guide, you should be able to migrate your project to the latest version of Apache StormCrawler with minimal effort. Ensure you update your Maven dependencies and handle the removal of the Elasticsearch module by either forking it or migrating to the OpenSearch module.

For any further assistance, reach out to the community via mailing lists or GitHub discussions.