Audio
Three audio surfaces, all backed by ElevenLabs. The HTTP routes mirror OpenAI's audio.speech and audio.transcriptions APIs — the OpenAI SDKs work as-is with a swapped baseURL. The voice-to-voice route is a WebSocket because the call is bidirectional and long-lived; there is no OpenAI SDK shape for it, so callers either use the Quota SDK or speak the protocol directly. TTS bills per input character; STT bills per second of returned transcript; V2V bills per second of input audio.
Text-to-speech
Request
curl https://api.usequota.ai/v1/audio/speech \
-H "Authorization: Bearer $QUOTA_API_KEY" \
-H "Content-Type: application/json" \
--output speech.mp3 \
-d '{
"model": "elevenlabs/eleven_multilingual_v2",
"input": "The quick brown fox jumps over the lazy dog.",
"voice": "rachel",
"response_format": "mp3"
}'Request body
| modelstringrequired | Must be prefixed with elevenlabs/. See TTS models below. |
| inputstringrequired | Text to synthesize. Hard cap of 5,000 characters per request — split longer text yourself. ElevenLabs bills per character, so input.length directly determines cost. |
| voicestring | Voice name (e.g. rachel, adam, bella) or a 20-character ElevenLabs voice ID. Defaults to rachel. See Voices. |
| response_formatstring | One of mp3, opus, aac, flac, wav, pcm. Defaults to mp3. ElevenLabs HTTP TTS only emits MP3 and PCM natively — opus/aac fall back to MP3, flac/wav fall back to PCM. |
| speednumber | Playback speed multiplier, between 0.5 and 2 (ElevenLabs supports roughly 0.7–1.2). Forwarded as voice_settings.speed. |
| streamboolean | When true, audio bytes stream back chunked from ElevenLabs's streaming endpoint instead of buffering server-side — first-byte latency drops noticeably for long inputs. Billing is identical: TTS cost is known up-front from input.length, so the deduction is the same either way. Defaults to false. |
Streaming
Set stream: true to receive audio bytes chunk-by-chunk as ElevenLabs renders them, instead of waiting for the full file. First-byte latency drops noticeably for long inputs — useful when you're piping audio straight to a player or speaker. Billing is identical to one-shot (TTS cost is known up-front from input.length), so the only thing that changes is when you start getting bytes.
# Pipe chunks straight to ffplay (or any stdin-reading player)
curl -N https://api.usequota.ai/v1/audio/speech \
-H "Authorization: Bearer $QUOTA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "elevenlabs/eleven_turbo_v2_5",
"input": "Streaming is great for long passages — first audio byte arrives in under a second.",
"voice": "rachel",
"stream": true
}' | ffplay -autoexit -nodisp -Response headers (X-Quota-Credits-Used, X-Quota-Balance, etc.) are sent before the first audio chunk, so they're readable as soon as fetch resolves — no need to wait for response.body to drain.
Response
Binary. The response body is raw audio bytes. The Content-Type header reflects the requested format: audio/mpeg for mp3/opus/aac, audio/wav for wav/flac, audio/pcm for pcm. With stream: true the body uses chunked transfer encoding; otherwise it's a single buffered payload.
Billing metadata comes back in response headers:
| X-Quota-Credits-Usedheader | Total credits debited (cost + markup). |
| X-Quota-Balanceheader | Wallet balance after the deduction. |
| X-Quota-Charactersheader | Character count actually billed (matches input length). |
| X-Quota-Markupheader | Markup credits added on top of base cost (only when configured). |
Pricing
Billed per character of input. Output audio length does not factor in.
| elevenlabs/eleven_multilingual_v2$0.18 / 1k chars | High-quality multilingual model. Best fit for long-form, multi-language, or expressive content. |
| elevenlabs/eleven_turbo_v2_5$0.10 / 1k chars | Faster, lower-latency Turbo model. Good default for interactive applications. |
| elevenlabs/eleven_flash_v2_5$0.10 / 1k chars | Lowest-latency Flash model. Pick this for real-time agent voices. |
Voices
Pass a curated short name or a raw 20-character ElevenLabs voice ID. Unknown names fall back to rachel. Voice IDs accept anything 20 characters or longer of mixed case alphanumerics — useful for voices you've cloned in your ElevenLabs account.
| racheldefault | Calm, warm female voice. The default. |
| domipreset | Confident female voice. |
| bellapreset | Soft female voice. |
| ellipreset | Young female voice. |
| antonipreset | Well-rounded male voice. |
| joshpreset | Deep male voice. |
| arnoldpreset | Crisp male voice. |
| adampreset | Mature male voice. |
| sampreset | Raspy male voice. |
TTS models
Always pass the full elevenlabs/-prefixed model ID. Bare ElevenLabs IDs (without prefix) return invalid_request.
| elevenlabs/eleven_multilingual_v2multilingual | Highest quality. 29 languages. |
| elevenlabs/eleven_turbo_v2_5turbo | Faster, lower latency. Good general-purpose default. |
| elevenlabs/eleven_flash_v2_5flash | Lowest latency. For real-time agents. |
Speech-to-text
Request
curl https://api.usequota.ai/v1/audio/transcriptions \
-H "Authorization: Bearer $QUOTA_API_KEY" \
-F file=@meeting.mp3 \
-F model=elevenlabs/scribe_v1 \
-F response_format=verbose_jsonForm fields
| filefilerequired | Audio bytes. Max 25 MB. Supported MIME types: mp3, wav, m4a, mp4, webm, ogg, flac, aac. |
| modelstringrequired | Must be elevenlabs/scribe_v1. No other STT models are supported yet. |
| languagestring | ISO 639-1 code (e.g. en, fr). Auto-detected when omitted. |
| response_formatstring | One of json (default), text, verbose_json. verbose_json adds per-word timestamps and confidence. |
| diarizeboolean | Tag each word with a speaker_id for multi-speaker audio. Adds latency. Defaults to false. |
| tag_audio_eventsboolean | Surface non-speech events (laughter, applause) as entries in words[]. Defaults to false. |
| timestamps_granularitystring | One of none, word, character. Defaults to word. |
Response
{ "text": "Hello from quota." }Billing metadata comes back in response headers:
| X-Quota-Credits-Usedheader | Total credits debited after reconciliation against the exact audio duration. |
| X-Quota-Balanceheader | Wallet balance after the deduction. |
| X-Quota-Secondsheader | Billed audio seconds — ceil of the last word's end time, or the upload's nominal duration when no words were detected. |
| X-Quota-Markupheader | Markup credits added on top of base cost (only when configured). |
Pricing
Billed per second of audio. Quota pre-reserves a generous upper bound derived from the upload's nominal duration, then reconciles against the exact transcript duration once Scribe returns. Refunded in full on upstream error.
| elevenlabs/scribe_v1$0.40 / hour | High-accuracy multilingual transcription. Word-level timestamps, optional diarization, optional non-speech event tagging. |
Voice-to-voice (V2V)
Stream audio in, receive audio re-rendered in a target voice out. Billed per second of input audio. Sessions are capped at 30 minutes; callers can declare a shorter cap at handshake.
Lifecycle
- Client opens the WebSocket. No Authorization header is required on the upgrade — browsers can't set one on
new WebSocket(), so auth is the first text frame. - Client sends a JSON
authframe (see below). Quota validates the token, looks up pricing, atomically reserves the max-session cost, opens an upstream WebSocket to ElevenLabs, and writes anaudio_sessionsaudit row. - Quota replies with a JSON
auth_ackframe containing the session ID. From here, binary frames are bidirectional: client PCM in, ElevenLabs-converted audio out. - Every 30 seconds Quota does an in-memory insolvency check against your declared max duration. Heartbeat pings run on the same cadence; two missed pongs (~65 s) close the session.
- On any close — client, upstream, timeout, balance exhaustion, format violation — Quota refunds
reservation − actual_usedand updates the audit row. The session ID is returned in the close frame so you can correlate to your own telemetry.
Auth frame (client → server)
{
"type": "auth",
"token": "sk-quota-...",
"model": "elevenlabs/voice-conversion-v1",
"voice": "rachel",
"format": "pcm_16le_16k_mono",
"max_duration_seconds": 600
}| typestringrequired | Must be "auth". |
| tokenstringrequired | Quota API key (sk-quota-…) or an OAuth access token (quota_token_…) for user-scoped billing. |
| modelstringrequired | Must be elevenlabs/voice-conversion-v1. |
| voicestringrequired | ElevenLabs voice ID, or one of the named presets supported by TTS (rachel, adam, etc.). |
| formatstringrequired | Input audio format. One of:
4005 FORMAT_VIOLATION. |
| max_duration_secondsinteger | Optional. Caps the session length. Hard server-side ceiling is 1800 (30 minutes). The pre-reservation is sized from this value, so smaller values lower the up-front credit hold. |
Auth ack (server → client)
{
"type": "auth_ack",
"session_id": "9f4c…",
"upstream_ready": true,
"max_duration_seconds": 600
}After auth_ack, both directions switch to binary frames. Send raw PCM bytes; receive ElevenLabs-encoded MP3 (44.1 kHz, 128 kbps).
Close codes
Quota uses application-defined close codes (RFC 6455 §7.4.2, ≥ 4000). The Quota SDKs map these to typed errors; native callers should read event.code.
| Code | Name | Meaning |
|---|---|---|
4002 | BALANCE_EXHAUSTED | Session exceeded its max_duration_seconds. Bill the user for what they used; the reservation diff is refunded. |
4003 | UPSTREAM_DISCONNECTED | ElevenLabs closed the upstream socket. Reconnect with a new session if needed. |
4004 | SESSION_TIMEOUT | Session hit the 30-minute hard cap. |
4005 | FORMAT_VIOLATION | Input bytes/second exceeded 1.5× the declared format ceiling. Indicates a format misdeclaration or compressed-as-PCM attack. |
4006 | INVALID_AUTH | First frame missing, malformed, or token invalid. No upstream connection was opened. |
4007 | HEARTBEAT_LOST | Two server pings went unanswered (~65 s). Common cause is a dropped network — clients should reconnect. |
Pricing
elevenlabs/voice-conversion-v1 bills per second of input audio at $9.00 per hour. The reservation at handshake is max_duration_seconds × price; the diff is refunded on close. Output audio length is ≈ 1:1 with input and is not billed separately.
Browser quickstart
import { useVoiceConversion } from "@usequota/nextjs";
export function VoiceClone() {
const { start, stop, state, error } = useVoiceConversion({
model: "elevenlabs/voice-conversion-v1",
voice: "rachel",
format: "pcm_16le_16k_mono",
maxDurationSeconds: 600,
});
return (
<button onClick={state === "open" ? stop : start}>
{state === "open" ? "Stop" : "Speak"}
{error && <span> — {error.message}</span>}
</button>
);
}User-scoped billing
For end-user-pays flows, pass the user's OAuth access token (quota_token_…) as the bearer instead of your API key. Quota debits the user's wallet rather than yours. Works identically for /v1/audio/speech, /v1/audio/transcriptions, and the V2V WebSocket (pass the OAuth token in the auth frame's token field). See Sign in with Quota or Connect Quota Wallet for the end-to-end OAuth flow.
Errors
Errors use the standard Quota envelope:
{
"error": {
"code": "insufficient_credits",
"message": "Insufficient credits. Balance: 0, Required: 90",
"hint": "..."
}
}input over 5,000 chars (TTS) or malformed multipart (STT); unsupported response_format; or model not prefixed with elevenlabs/.