[BUG] EXL2 Concurrent request slows performance significantly

### OS

Linux

### GPU Library

CUDA 12.x

### Python version

3.11

### Describe the bug

When allowing concurrent requests on EXL2 it will slow throughput significantly.

In some events, as low as 1 word/second on the response.

This is with 2-3 concurrent requests with Q8 quant cache.

Are there any solutions to this?

### Reproduction steps

.

### Expected behavior

.

### Logs

_No response_

### Additional context

_No response_

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I have read the disclaimer, and this issue is related to a code bug. If I have a question, I will use the Discord server.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] EXL2 Concurrent request slows performance significantly #369

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] EXL2 Concurrent request slows performance significantly #369

Description

OS

GPU Library

Python version

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions