Summary
Please add audio input as a content type in the Messages API (similar to existing image support). Currently the API accepts text and images (JPEG, PNG, GIF, WebP) but not raw audio.
Why it matters
- Tone and intent: With only speech-to-text (STT), applications send transcribed text to the model. All prosody, tone, and speaker identity are lost. The same phrase can mean different things depending on how it was said; models cannot disambiguate without access to the actual sound.
- Accessibility and UX: Users who prefer voice input (or rely on it) would benefit from the model receiving the full signal, not just a transcript. This is especially important for assistive use cases.
- Music and non-speech audio: Use cases like music understanding, sound design, or analysis of any non-speech audio are impossible without native audio input.
- Format: Uncompressed formats like WAV (PCM) would be a natural first choice—no codec to decode, minimal complexity on the client side.
Suggested API shape
Analogous to ImageBlockParam, add an audio content block, e.g.:
type: "audio"
source: base64 or URL
media_type: e.g. "audio/wav", "audio/mpeg" (if multiple formats are supported)
So the model receives audio as a first-class modality (like images today), not as opaque bytes in text.
Current workaround
None. Passing base64/hex of WAV in a text block only gives the model a string of data; it has no audio modality to perceive tone, timbre, or meaning from the waveform. STT is a workaround for speech-only flows but discards the very information we are asking for.
Thank you for considering this. It would unlock a lot of applications that today are limited by text-only or image+text input.
Summary
Please add audio input as a content type in the Messages API (similar to existing image support). Currently the API accepts text and images (JPEG, PNG, GIF, WebP) but not raw audio.
Why it matters
Suggested API shape
Analogous to
ImageBlockParam, add an audio content block, e.g.:type: "audio"source: base64 or URLmedia_type: e.g."audio/wav","audio/mpeg"(if multiple formats are supported)So the model receives audio as a first-class modality (like images today), not as opaque bytes in text.
Current workaround
None. Passing base64/hex of WAV in a text block only gives the model a string of data; it has no audio modality to perceive tone, timbre, or meaning from the waveform. STT is a workaround for speech-only flows but discards the very information we are asking for.
Thank you for considering this. It would unlock a lot of applications that today are limited by text-only or image+text input.