BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem#1380
BIGTOP-284 Integrate Apache Nutch into the Apache Bigtop ecosystem#1380lewismc wants to merge 25 commits intoapache:masterfrom
Conversation
Testing the Nutch integrationThis guidance is intended for peer reviewers interested in the Apache Nutch integration in Bigtop. Nutch is built from source with Ant ( Prerequisites
restart/exit session
1. Build the Nutch packageFrom the Bigtop repo root: ./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged"To build Nutch and its dependencies (e.g. Hadoop) in Docker: ./gradlew nutch-pkg-ind -POS=ubuntu-22.04 -Pdocker-run-option="--privileged" -Dbuildwithdeps=trueOutput appears under 2. Run the Nutch smoke testsSmoke tests require a Hadoop cluster: they use HDFS for seed URLs, crawldb, and segments, and they expect On a host where Nutch and Hadoop are already installedSet the environment and run the Nutch smoke tests: export JAVA_HOME=/path/to/jdk
export HADOOP_CONF_DIR=/etc/hadoop/conf # or your cluster's conf dir
./gradlew bigtop-tests:smoke-tests:nutch:test -Psmoke.testsOr from the smoke-tests directory: cd bigtop-tests/smoke-tests
../../gradlew nutch:test -Psmoke.testsTests run in order: usage, inject subcommand, inject + readdb on HDFS, then generate on HDFS. Cleanup removes Via Docker provisioner (full stack + smoke)
3. Deploy Nutch with PuppetTo deploy Nutch on a Bigtop-managed cluster, include hadoop_cluster_node::cluster_components:
- hdfs
- yarn
- mapreduce
- nutchNodes that receive the 4. Quick sanity checks (no cluster)Without a cluster you can still confirm that the test project loads and compiles: ./gradlew bigtop-tests:smoke-tests:nutch:tasks
./gradlew bigtop-tests:smoke-tests:nutch:compileTestGroovyThe full test suite will not pass without 5. What the smoke tests do
All tests use the deploy runtime (cluster mode) and HDFS only; there are no local-mode or |
|
Testing based on the above guidance. OS information |
Description of PR
BIGTOP-284 seeks to introduce Apache Nutch smoke tests into the Bigtop ecosystem. I commented on the original ticket way back in 2011 and never did anything about it. This PR seeks to address that.
Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accommodates a wide variety of data acquisition tasks. Nutch relies on Apache Hadoop data structures, Nutch is great for batch processing large data volumes via MapReduce jobs but can also be tailored to smaller jobs.
How was this patch tested?
Testing is ongoing. The goal is for the Nutch community to test this patch and hopefully update this thread with feedback. More details to follow.
For code changes: