I am unable to run the incremental decoding with the c++ interface without errors. Tried with meta-llama/Llama-2-7b-hf, other llama models, and an OPT model.
[10]29889
No small speculative model registered, using incremental decoding.
[0 - 7ffff4921000] 1.042109 {3}{RequestManager}: [1000358]New request tokens: 1 14350 263 26228 21256 1048 7535 17770 363 596 10462 29889
optimal_views.size = 294
views.size() = 294
###PEFT DEBUGGING### Operators reconstructed from optimized graph.
###PEFT DEBUGGING### Starting inplace optimizations.
###PEFT DEBUGGING### Mapping output tensors.
ndim(1) dims[1 0 0 0]
###PEFT DEBUGGING### Setting up NCCL communications.
###PEFT DEBUGGING### compile_inference completed successfully.
Loading weight file embed_tokens.weight
Loading weight file layers.0.input_layernorm.weight
Loading weight file layers.0.self_attn.q_proj.weight
incr_decoding: /home/ubuntu/FlexFlow/deps/legion/runtime/legion/runtime.cc:4991: void Legion::Internal::PhysicalRegionImpl::wait_until_valid(bool, const char*, bool, const char*): Assertion `implicit_context == context' failed.
Thread 10 "incr_decoding" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffefcf4000 (LWP 18011)]
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737216724992) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737216724992) at ./nptl/pthread_kill.c:44
flexflow/flexflow-train#1 __pthread_kill_internal (signo=6, threadid=140737216724992) at ./nptl/pthread_kill.c:78
flexflow/flexflow-train#2 __GI___pthread_kill (threadid=140737216724992, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
flexflow/flexflow-train#3 0x00007fffece42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
flexflow/flexflow-train#4 0x00007fffece287f3 in __GI_abort () at ./stdlib/abort.c:79
flexflow/flexflow-train#5 0x00007fffece2871b in __assert_fail_base (fmt=0x7fffecfdd130 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7ffff3c00b7b "implicit_context == context",
file=0x7ffff3bfd360 "/home/ubuntu/FlexFlow/deps/legion/runtime/legion/runtime.cc", line=4991, function=<optimized out>) at ./assert/assert.c:92
flexflow/flexflow-train#6 0x00007fffece39e96 in __GI___assert_fail (assertion=0x7ffff3c00b7b "implicit_context == context", file=0x7ffff3bfd360 "/home/ubuntu/FlexFlow/deps/legion/runtime/legion/runtime.cc", line=4991,
function=0x7ffff3c016f8 "void Legion::Internal::PhysicalRegionImpl::wait_until_valid(bool, const char*, bool, const char*)") at ./assert/assert.c:101
flexflow/flexflow-train#7 0x00007ffff2ad68b8 in Legion::Internal::PhysicalRegionImpl::wait_until_valid (this=0x7ff754203e70, silence_warnings=false, warning_string=0x0, warn=false, source=0x0)
at /home/ubuntu/FlexFlow/deps/legion/runtime/legion/runtime.cc:4991
flexflow/flexflow-train#8 0x00007ffff2624402 in Legion::PhysicalRegion::wait_until_valid (this=0x7ff7542005d8, silence_warnings=false, warning_string=0x0) at /home/ubuntu/FlexFlow/deps/legion/runtime/legion/legion.cc:2772
flexflow/flexflow-train#9 0x00007ffff65530eb in FlexFlow::ParallelTensorBase::set_tensor<__half> (this=0x7ff762541c30, ff=0x7ff7641e7db0, dim_sizes=std::vector of length 1, capacity 1 = {...}, data=0x7ff754201320)
at /home/ubuntu/FlexFlow/src/runtime/parallel_tensor.cc:680
flexflow/flexflow-train#10 0x00007ffff62dcb4a in FileDataLoader::load_single_weight_tensor<__half> (this=0x7ff765508840, ff=0x7ff7641e7db0, l=0x7ff76498f5e0, weight_idx=0) at /home/ubuntu/FlexFlow/src/runtime/file_loader.cc:849
flexflow/flexflow-train#11 0x00007ffff62dad8c in FileDataLoader::load_weight_task (task=0x7ff724a142e0, regions=std::vector of length 0, capacity 0, ctx=0x7ff76c2effe0, runtime=0x555556b68000)
at /home/ubuntu/FlexFlow/src/runtime/file_loader.cc:864
flexflow/flexflow-train#12 0x00007ffff64cc78a in Legion::LegionTaskWrapper::legion_task_wrapper<&FileDataLoader::load_weight_task> (args=0x7ff724a24990, arglen=8, userdata=0x0, userlen=0, p=...)
at /home/ubuntu/FlexFlow/deps/legion/runtime/legion/legion.inl:21215
flexflow/flexflow-train#13 0x00007fffedee02cc in Realm::LocalTaskProcessor::execute_task (this=0x555556ad9bf0, func_id=19, task_args=...) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/proc_impl.cc:1176
flexflow/flexflow-train#14 0x00007fffedf5fc4b in Realm::Task::execute_on_processor (this=0x7ff724a24810, p=...) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/tasks.cc:326
flexflow/flexflow-train#15 0x00007fffedf650fc in Realm::UserThreadTaskScheduler::execute_task (this=0x555556ad9f90, task=0x7ff724a24810) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/tasks.cc:1687
flexflow/flexflow-train#16 0x00007fffedf62deb in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x555556ad9f90) at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/tasks.cc:1160
flexflow/flexflow-train#17 0x00007fffedf6b5be in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop> (obj=0x555556ad9f90)
at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/threads.inl:97
flexflow/flexflow-train#18 0x00007fffedf7ab2d in Realm::UserThread::uthread_entry () at /home/ubuntu/FlexFlow/deps/legion/runtime/realm/threads.cc:1428
flexflow/flexflow-train#19 0x00007fffece5a130 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:90 from /lib/x86_64-linux-gnu/libc.so.6
flexflow/flexflow-train#20 0x0000
Create and ssh into a g4dn.8xlarge instance with AMI Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Ubuntu 22.04) 20241119
git clone --recursive https://github.com/flexflow/FlexFlow.git
export FF_GPU_BACKEND=cuda
export cuda_version=12.2 # aws instance has CUDA 12.4, but only 12.2 is supported by FF
cd FlexFlow
curl https://sh.rustup.rs -sSf | sh -s -- -y
source ~/.bashrc
vim config/config.linux # change build type to Debug
mkdir build
cd build
../config/config.linux
make
cd ..
pip install .
huggingface-cli login
python3 ./inference/utils/download_hf_model.py meta-llama/Llama-2-7b-hf
cd build
wget -O chatgpt.json https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
gdb --args ./inference/incr_decoding/incr_decoding -ll:gpu 1 -ll:cpu 4 -ll:fsize 7000 -ll:zsize 32000 -llm-model meta-llama/Llama-2-7b-hf -prompt chatgpt.json -tensor-parallelism-degree 1
r
I am unable to run the incremental decoding with the c++ interface without errors. Tried with meta-llama/Llama-2-7b-hf, other llama models, and an OPT model.
I get the error below (with backtrace in gdb)
Steps to reproduce
Create and ssh into a g4dn.8xlarge instance with AMI Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.5.1 (Ubuntu 22.04) 20241119