Install rust nightly using
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
and then
rustup install nightly
cargo install --force --path danny --locked
cargo install --force --path danny-utilities --locked
Move to the datasets subdirectory and define the following two environment variables
DANNY_DATA_DIR=/path/to/directory/to/store/datasets
DANNY_MINIONS=hostnames,separated,by,comma,that,execute,experiments # Or localhost if you are running locally
The following command with download and preprocess all datasets, if they are not already in your machine! Takes a long time.
./datasets/prepare.py --list
If you are interested in just one dataset, edit the prepare.py and edit the DATASETS dictionary, removing the ones you don't need.
Also, find and comment out the following loop:
for d in derived_datasets:
DATASETS[d.name] = dIf running locally, you might find more convenient to work with a small dataset.
You can use the sampledata binary that was installed alongside the other utilities.
An example usage is the following, for taking 5000 points from dataset livejournal:
sampledata --size 5000 $DANNY_DATA_DIR/Livejournal.bin $DANNY_DATA_DIR/Livejournal-5000.bin
If you are sampling from a dataset which uses the cosine distance, use --measure cosine.
You can define several environment variables to control the behavior of danny, which are described in danny --help.
Example invocation of the one round, fixed parameter LSH algorithm:
danny --algorithm local-lsh -k 8 --range 0.5 $DANNY_DATA_DIR/Livejournal-5000.bin $DANNY_DATA_DIR/Livejournal-5000.bin
For a list of all available options and algorithms, please consult danny --help.
To run the code on a cluster of machines, you invoke danny on one of the machines and provide a list of all the hosts to use in a file: the executable
will take care of copying itself to all the workers with rsync and spawning worker processes on all listed machines using ssh. Therefore it is best to have
passwordless ssh configured in your cluster.
The file listing hosts should contain host:port pairs, like the following (any port number will do):
host1:2001
host2:2001
host3:2001
host4:2001
host5:2001
Let the above file be ~/hosts.txt. Then you can invoke danny as follows:
danny --hosts ~/hosts.txt --threads 8 --threshold 0.7 --algorithm local-lsh --recall 0.8 --k 4 $PATH_TO_DATA
which will run danny using 8 threads on each of the 5 listed hosts,
using the local-lsh algorithm with k=4 and required recall 0.8, at similarity threshold 0.7.
There are four available algorithms:
local-lshone-level-lshtwo-level-lsh, which takes an additional parameter--k2for the number of hash functions to use locallycartesian
Additionally, you can specify the number of sketch bits to use using the --sketch-bits argument, which
takes values in 0, 64, 128, 256, 512.
If you are changing the code, you can run the modified versions without reinstalling
everything using, instead of danny, the command
cargo run --release --bin danny --
The trailing -- is very important: after it you can put the program's arguments, before it are cargo's arguments.
The compilation takes a long time. To reduce this time, during development only some parts of the code can be compiled, using cargo feature gates. In particular the following features are available:
one-round-lshtwo-round-lshhu-et-alall-2-allseq-all-2-allsketching
By default they are all enabled. To disable them one can pass the --no-default-features flag to cargo, along with the --features flag to actually use the ones that are needed.
For instant, while modifying and testing locally the code for two-round-lsh one can use the command:
cargo run --no-default-features --features two-round-lsh --release --bin danny -- -m jaccard --algorithm two-round-lsh -k 4 -l 5 --range 0.75 --sketch-bits 0 $DANNY_DATA_DIR/Livejournal-5000.bin $DANNY_DATA_DIR/Livejournal-5000.bin
This command must be run inside the danny subdirectory, otherwise all features are built nonetheless, I don't know why.