BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem by lewismc · Pull Request #1380 · apache/bigtop

lewismc · 2026-02-28T23:31:43Z

Description of PR

BIGTOP-284 seeks to introduce Apache Nutch smoke tests into the Bigtop ecosystem. I commented on the original ticket way back in 2011 and never did anything about it. This PR seeks to address that.
Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accommodates a wide variety of data acquisition tasks. Nutch relies on Apache Hadoop data structures, Nutch is great for batch processing large data volumes via MapReduce jobs but can also be tailored to smaller jobs.

How was this patch tested?

Testing is ongoing. The goal is for the Nutch community to test this patch and hopefully update this thread with feedback. More details to follow.

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'BIGTOP-3638. Your PR title ...')?
Make sure that newly added files do not have any licensing issues. When in doubt refer to https://www.apache.org/licenses/

lewismc · 2026-02-28T23:36:46Z

Testing the Nutch integration

This guidance is intended for peer reviewers interested in the Apache Nutch integration in Bigtop. Nutch is built from source with Ant (ant runtime), packaged using runtime/deploy for Hadoop cluster execution, and all smoke tests run against a Hadoop cluster using HDFS.

Prerequisites

Around ~20GB free disk space (to be safe)

sudo apt update && sudo apt upgrade && sudo apt install zip && sudo apt install openjdk-11-jdk
# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-compose ruby
sudo usermod -aG docker $USER

restart/exit session

git clone https://github.com/lewismc/bigtop.git && cd bigtop && git checkout -b BIGTOP-284 && git pull origin BIGTOP-284

Hadoop cluster – Smoke tests require a running cluster (HDFS and YARN). They use HADOOP_CONF_DIR and will not run without it.
x86_64 Linux – Building Nutch packages via nutch-pkg-ind uses the Bigtop Docker slave image, which is only published for x86_64. On Apple Silicon (arm64), the build script uses --platform linux/amd64; running amd64 containers under emulation can fail with "exec format error", so building packages is most reliable on native x86_64 Linux.

1. Build the Nutch package

From the Bigtop repo root:

./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged"

To build Nutch and its dependencies (e.g. Hadoop) in Docker:

./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged" -Dbuildwithdeps=true

Output appears under build/nutch/ and output/nutch/. The installed Nutch uses runtime/deploy (uber jar and scripts that run via hadoop jar on the cluster).

2. Run the Nutch smoke tests

Smoke tests require a Hadoop cluster: they use HDFS for seed URLs, crawldb, and segments, and they expect HADOOP_CONF_DIR to be set.

On a host where Nutch and Hadoop are already installed

Set the environment and run the Nutch smoke tests:

export JAVA_HOME=/path/to/jdk
export HADOOP_CONF_DIR=/etc/hadoop/conf   # or your cluster's conf dir
./gradlew bigtop-tests:smoke-tests:nutch:test -Psmoke.tests

Or from the smoke-tests directory:

cd bigtop-tests/smoke-tests
../../gradlew nutch:test -Psmoke.tests

Tests run in order: usage, inject subcommand, inject + readdb on HDFS, then generate on HDFS. Cleanup removes /user/root/nutch-smoke from HDFS.

Via Docker provisioner (full stack + smoke)

Build packages (with deps if needed) and enable the local repo in provisioner/docker/config.yaml:
- enable_local_repo: true
- nutch is already in components and smoke_test_components.
From provisioner/docker/:
```
./docker-hadoop.sh --create 3 --smoke-tests
```
This provisions a cluster (including Nutch), then runs all smoke tests (including Nutch). Ensure the provisioner has enough resources and that the Nutch packages are present in the local repo (e.g. under output/apt or equivalent).

3. Deploy Nutch with Puppet

To deploy Nutch on a Bigtop-managed cluster, include nutch in the cluster components (e.g. in Hiera or site.yaml):

hadoop_cluster_node::cluster_components:
  - hdfs
  - yarn
  - mapreduce
  - nutch

Nodes that receive the nutch-client role will have the Nutch package installed and /etc/default/nutch configured with NUTCH_HOME, NUTCH_CONF_DIR, and HADOOP_CONF_DIR. Run crawl commands (e.g. nutch inject, nutch generate) from a gateway/client node against HDFS paths.

4. Quick sanity checks (no cluster)

Without a cluster you can still confirm that the test project loads and compiles:

./gradlew bigtop-tests:smoke-tests:nutch:tasks
./gradlew bigtop-tests:smoke-tests:nutch:compileTestGroovy

The full test suite will not pass without HADOOP_CONF_DIR and a running cluster.

5. What the smoke tests do

Test	Description
`testNutchUsage`	Runs `nutch` with no arguments; expects exit 0 and usage output.
`testNutchInjectSubcommand`	Runs `nutch inject` with no arguments; expects non-zero exit and usage/error message.
`testNutchInjectAndReaddb`	Creates `/user/root/nutch-smoke/urls/seed.txt` on HDFS, runs `nutch inject` and `nutch readdb -stats` on HDFS paths, asserts stats output.
`testNutchGenerate`	Runs `nutch generate` with HDFS crawldb and segments paths, then verifies at least one segment under the segments directory.

All tests use the deploy runtime (cluster mode) and HDFS only; there are no local-mode or /tmp-based crawl directories.

lewismc · 2026-03-01T09:39:01Z

Testing based on the above guidance.

BUILD SUCCESSFUL in 10m 25s
1 actionable task: 1 executed

OS information

Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.4 LTS
Release:	24.04
Codename:	noble

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

09f386e

lewismc mentioned this pull request Mar 1, 2026

Add Jenkinsfile sebastian-nagel/nutch-test-single-node-cluster#2

Merged

lewismc added 10 commits February 28, 2026 19:20

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

5f0ce1e

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

e1f527f

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

af63921

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

2200a29

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

27dd2c7

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

1bed6dd

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

b5d92a5

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

ef2d739

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

6800877

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

202a50c

lewismc added 6 commits March 1, 2026 10:58

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

a445ab5

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

796e170

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

ac1cbdc

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

22ec463

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

57b2b41

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

eac4490

lewismc marked this pull request as draft March 1, 2026 20:10

lewismc added 8 commits March 1, 2026 12:19

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

4c53f55

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

fefe706

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

2b5ea60

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

5c8aa85

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

715b6bc

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

16f4e8d

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

8ee3c51

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem

6308325

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem#1380

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem#1380
lewismc wants to merge 25 commits intoapache:masterfrom
lewismc:BIGTOP-284

lewismc commented Feb 28, 2026 •

edited

Loading

Uh oh!

lewismc commented Feb 28, 2026 •

edited

Loading

Uh oh!

lewismc commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lewismc commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

How was this patch tested?

For code changes:

Uh oh!

lewismc commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing the Nutch integration

Prerequisites

1. Build the Nutch package

2. Run the Nutch smoke tests

On a host where Nutch and Hadoop are already installed

Via Docker provisioner (full stack + smoke)

3. Deploy Nutch with Puppet

4. Quick sanity checks (no cluster)

5. What the smoke tests do

Uh oh!

lewismc commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lewismc commented Feb 28, 2026 •

edited

Loading

lewismc commented Feb 28, 2026 •

edited

Loading