Skip to content

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem#1380

Draft
lewismc wants to merge 25 commits intoapache:masterfrom
lewismc:BIGTOP-284
Draft

BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem#1380
lewismc wants to merge 25 commits intoapache:masterfrom
lewismc:BIGTOP-284

Conversation

@lewismc
Copy link
Member

@lewismc lewismc commented Feb 28, 2026

Description of PR

BIGTOP-284 seeks to introduce Apache Nutch smoke tests into the Bigtop ecosystem. I commented on the original ticket way back in 2011 and never did anything about it. This PR seeks to address that.
Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accommodates a wide variety of data acquisition tasks. Nutch relies on Apache Hadoop data structures, Nutch is great for batch processing large data volumes via MapReduce jobs but can also be tailored to smaller jobs.

How was this patch tested?

Testing is ongoing. The goal is for the Nutch community to test this patch and hopefully update this thread with feedback. More details to follow.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'BIGTOP-3638. Your PR title ...')?
  • Make sure that newly added files do not have any licensing issues. When in doubt refer to https://www.apache.org/licenses/

@lewismc
Copy link
Member Author

lewismc commented Feb 28, 2026

Testing the Nutch integration

This guidance is intended for peer reviewers interested in the Apache Nutch integration in Bigtop. Nutch is built from source with Ant (ant runtime), packaged using runtime/deploy for Hadoop cluster execution, and all smoke tests run against a Hadoop cluster using HDFS.

Prerequisites

  • Around ~20GB free disk space (to be safe)
sudo apt update && sudo apt upgrade && sudo apt install zip && sudo apt install openjdk-11-jdk
# Add Docker's official GPG key:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-compose ruby
sudo usermod -aG docker $USER

restart/exit session

git clone https://github.com/lewismc/bigtop.git && cd bigtop && git checkout -b BIGTOP-284 && git pull origin BIGTOP-284
  • Hadoop cluster – Smoke tests require a running cluster (HDFS and YARN). They use HADOOP_CONF_DIR and will not run without it.
  • x86_64 Linux – Building Nutch packages via nutch-pkg-ind uses the Bigtop Docker slave image, which is only published for x86_64. On Apple Silicon (arm64), the build script uses --platform linux/amd64; running amd64 containers under emulation can fail with "exec format error", so building packages is most reliable on native x86_64 Linux.

1. Build the Nutch package

From the Bigtop repo root:

./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged"

To build Nutch and its dependencies (e.g. Hadoop) in Docker:

./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged" -Dbuildwithdeps=true

Output appears under build/nutch/ and output/nutch/. The installed Nutch uses runtime/deploy (uber jar and scripts that run via hadoop jar on the cluster).

2. Run the Nutch smoke tests

Smoke tests require a Hadoop cluster: they use HDFS for seed URLs, crawldb, and segments, and they expect HADOOP_CONF_DIR to be set.

On a host where Nutch and Hadoop are already installed

Set the environment and run the Nutch smoke tests:

export JAVA_HOME=/path/to/jdk
export HADOOP_CONF_DIR=/etc/hadoop/conf   # or your cluster's conf dir
./gradlew bigtop-tests:smoke-tests:nutch:test -Psmoke.tests

Or from the smoke-tests directory:

cd bigtop-tests/smoke-tests
../../gradlew nutch:test -Psmoke.tests

Tests run in order: usage, inject subcommand, inject + readdb on HDFS, then generate on HDFS. Cleanup removes /user/root/nutch-smoke from HDFS.

Via Docker provisioner (full stack + smoke)

  1. Build packages (with deps if needed) and enable the local repo in provisioner/docker/config.yaml:

    • enable_local_repo: true
    • nutch is already in components and smoke_test_components.
  2. From provisioner/docker/:

    ./docker-hadoop.sh --create 3 --smoke-tests

    This provisions a cluster (including Nutch), then runs all smoke tests (including Nutch). Ensure the provisioner has enough resources and that the Nutch packages are present in the local repo (e.g. under output/apt or equivalent).

3. Deploy Nutch with Puppet

To deploy Nutch on a Bigtop-managed cluster, include nutch in the cluster components (e.g. in Hiera or site.yaml):

hadoop_cluster_node::cluster_components:
  - hdfs
  - yarn
  - mapreduce
  - nutch

Nodes that receive the nutch-client role will have the Nutch package installed and /etc/default/nutch configured with NUTCH_HOME, NUTCH_CONF_DIR, and HADOOP_CONF_DIR. Run crawl commands (e.g. nutch inject, nutch generate) from a gateway/client node against HDFS paths.

4. Quick sanity checks (no cluster)

Without a cluster you can still confirm that the test project loads and compiles:

./gradlew bigtop-tests:smoke-tests:nutch:tasks
./gradlew bigtop-tests:smoke-tests:nutch:compileTestGroovy

The full test suite will not pass without HADOOP_CONF_DIR and a running cluster.

5. What the smoke tests do

Test Description
testNutchUsage Runs nutch with no arguments; expects exit 0 and usage output.
testNutchInjectSubcommand Runs nutch inject with no arguments; expects non-zero exit and usage/error message.
testNutchInjectAndReaddb Creates /user/root/nutch-smoke/urls/seed.txt on HDFS, runs nutch inject and nutch readdb -stats on HDFS paths, asserts stats output.
testNutchGenerate Runs nutch generate with HDFS crawldb and segments paths, then verifies at least one segment under the segments directory.

All tests use the deploy runtime (cluster mode) and HDFS only; there are no local-mode or /tmp-based crawl directories.

@lewismc
Copy link
Member Author

lewismc commented Mar 1, 2026

Testing based on the above guidance.

BUILD SUCCESSFUL in 10m 25s
1 actionable task: 1 executed

OS information

Distributor ID:	Ubuntu
Description:	Ubuntu 24.04.4 LTS
Release:	24.04
Codename:	noble

@lewismc lewismc marked this pull request as draft March 1, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant