Since this is a minor feature request, I don't believe it to require a RFC. If the team thinks a formal RFC would be best, just let me know and I'll do the "pull request" dance.
Summary
Add "DS_Consume" and "DS_Append" intrinsics to HCC.
Motivation
DS_Consume and DS_Append can be used to implement highly efficient, compact queues to LDS within a wavefront. Support for these functions seems to exist as early as GCN 1.0.
Detailed design
According to the GCN ISA, DS_Append increment an LDS variable by the popcount of the execution mask. For example, if 40 threads are active, DS_Append would increment the location by +=40. DS_Consume is the inverse, it would decrement the location by the population count of the execution mask.
HCC already implements a number of intrinsics, such as __amdgcn_ds_bpermute. Following the convention, the functions would look something like this:
int __amdgcn_ds_append(tile_static int& val);
int __amdgcn_ds_consume(tile_static int& val);
The return value is the pre-operation value, as per the ISA.
Drawbacks
DS_Consume and DS_Append are somewhat obscure functions of the hardware. I'm not sure if many people would be aware of how to use the functions.
Alternatives
The functions could take a pointer instead, like this:
int __amdgcn_ds_append(tile_static int* val);
The pointer is more C-like, while the reference would be C++ like code.
Unresolved questions
These functions also can be used with GDS memory, but I don't know how GDS memory works.