Deepgram’s Text-to-Speech API lets you control the encoding, container, and sample rate of every voice response. That flexibility is powerful but the docs don’t spell out when to use MP3, Opus, AAC, or uncompressed WAV. Here’s the breakdown I rely on when choosing the right format for a product launch, real-time chatbot, or post-production pipeline.
For the full list of available parameters, see the Deepgram Text-to-Speech API Reference Docs.
Quick comparison table
| Encoding | Container | Sample rate | Bitrate | Quality | File size | Latency | Ideal use case | API params |
|---|---|---|---|---|---|---|---|---|
mp3 | none | 22050 Hz | 48 kbps (default) | Good | Moderate | Fast | General TTS playback | ?model=voiceModel&encoding=mp3 |
opus | ogg | 48000 Hz (fixed) | 4 – 650 kbps (configurable) | Very good | Smallest | Fastest | Real-time chat, streaming | ?model=voiceModel&encoding=opus&container=ogg |
aac | none | Device-dependent (typically 48 kHz) | 4 – 192 kbps | Better than MP3 | Small | Medium | Mobile apps, quality-focused web | ?model=voiceModel&encoding=aac |
linear16 | wav | 16000 – 48000 Hz | Uncompressed PCM | Highest | Largest | Slowest | Audio analysis, telephony, signal processing | ?model=voiceModel&encoding=linear16&container=wav&sample_rate=48000 |
Format breakdown
MP3 (default)
- When to use it: Standard playback, podcasts, browser compatibility.
- Why it works: Balanced quality at 48 kbps, plays on virtually every device.
- Trade-off: Larger than Opus and not as efficient as AAC, but the widest compatibility wins when you don’t control the client.
Opus in OGG (my fastest pick)
- When to use it: Real-time agents, streaming experiences, low-bandwidth scenarios.
- Why it works: Opus was built for voice. The audio is crisp, file sizes are tiny, and latency stays low during encoding and playback.
- Gotcha: Sample rate is locked at 48 kHz. Don’t include
sample_ratein your query; it triggers a 400 error.
AAC
- When to use it: Native mobile apps or web players that support AAC and need higher quality than MP3 at similar bitrates.
- Why it works: Efficient compression without the metallic artifacts you sometimes hear in MP3.
- Note: Sample rate is managed internally by Deepgram; no need to override it.
Linear16 WAV
- When to use it: Signal processing, telephony integration, or any workflow that requires raw PCM.
- Why it works: Zero compression. You get every detail for post-processing, noise analysis, or on-prem speech analytics.
- Trade-off: Huge files and slower encode/decode times. Overkill for casual playback.
Decision guide
- Simple playback with broad compatibility? Stick to
mp3. - Need low-latency streaming or tiny files? Use
opuswith anoggcontainer. - Want better compression without losing polish?
aachits the balance for mobile/web apps. - Performing downstream audio analysis? Go
linear16+wav.
Example API snippets
# Default MP3 (well-supported general output)
GET /v1/speak?model={{ voiceModel }}&encoding=mp3
# Opus + OGG (low latency, minimal file size)
GET /v1/speak?model={{ voiceModel }}&encoding=opus&container=ogg
# AAC (higher quality vs MP3 at similar size)
GET /v1/speak?model={{ voiceModel }}&encoding=aac
# Linear16 WAV (raw audio for processing)
GET /v1/speak?model={{ voiceModel }}&encoding=linear16&container=wav&sample_rate=48000Final thoughts
Deepgram’s defaults already deliver solid TTS, but choosing the right encoding can shave seconds off response time or preserve detail for machine listening. Use Opus when latency matters, MP3 when compatibility matters, AAC when quality-to-size matters, and WAV when signal fidelity is non-negotiable. The right parameter tweak turns a generic voice into a production-ready asset.