Skip to content

Conversation

@awni
Copy link
Member

@awni awni commented Jan 11, 2026

  • Adds a basic qmv kernel for fp quants for CUDA.
  • Adds a simple quantize-dequantize kernel for CUDA, Metal, CPU
  • Routes the qqmv to the quantize-dequantize + qmv for all backends

@awni awni force-pushed the fp_qmv branch 3 times, most recently from 0c5b9de to 2357ccd Compare January 22, 2026 17:47
@awni awni force-pushed the fp_qmv branch 3 times, most recently from 586737c to 458262b Compare January 22, 2026 18:16
@awni
Copy link
Member Author

awni commented Jan 22, 2026

Moving out of draft.

There is a nice speedup over qqmm with cublas for the qmv case. On a Spark:

quant GB/s pre GB/s post
nvfp4 164.163 232.178
mxfp8 178.034 221.105

@awni awni marked this pull request as ready for review January 22, 2026 18:28
@awni
Copy link
Member Author

awni commented Jan 22, 2026

I think we can optimize the fp_qmv a bit more.. but it's a good start so probably worth landing and hill-climbing.

@awni awni requested review from angeloskath and zcbenz January 22, 2026 18:51
Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great and the perf seems already great... I guess the hill will be small :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants