Skip to content

Conversation

@tingboliao
Copy link

Based on the scalar implementation of rotm, we optimized it by using RVV 1.0 Intrinsic.
Subsequently, we developed related cases for the functional and performance verifications on K230 and K1.

The performance data are shown as below:

Parameter setting: OPENBLAS_LOOPS = 10000.

  1. K230 [C908, vlen = 128]@1.6GHz:
    | Cases | Scalar / MFlops | Optimized RVV / MFlops |
    | srotm.goto | 875.57 | 1536.78 |
    | drotm.goto | 799.77 | 1408.70 |

  2. K1 [C908, vlen = 256]@1.6GHz:
    | Cases | Scalar / MFlops | Optimized RVV / MFlops |
    | srotm.goto | 880.02 | 1490.44 |
    | drotm.goto | 811.13 | 1541.92 |

In the above data, the bigger value is, the better performance is.

tingbo.liao added 2 commits December 31, 2024 10:32
Signed-off-by: tingbo.liao <tingbo.liao@starfivetech.com>
Signed-off-by: tingbo.liao <tingbo.liao@starfivetech.com>
@martin-frbg
Copy link
Collaborator

Thanks - the numbers are very compelling, but I'm not entirely sure having that much architecture-specific code at the interface level is a good idea. At least I don't think we've done this before, and if every architecture ifdef'd their specific intrinsics implementation into it, the file would get unwieldy rather quickly. (Need some time to think about alternatives though - not sure if it's easy to add a kernel mapping for just riscv64 either...)

@tingboliao
Copy link
Author

Thanks, we will further consider new alternatives, and submit a new Pull Request (PR) later if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants