Quota
/ docs
Dashboard

Audio

Three audio surfaces, all backed by ElevenLabs. The HTTP routes mirror OpenAI's audio.speech and audio.transcriptions APIs — the OpenAI SDKs work as-is with a swapped baseURL. The voice-to-voice route is a WebSocket because the call is bidirectional and long-lived; there is no OpenAI SDK shape for it, so callers either use the Quota SDK or speak the protocol directly. TTS bills per input character; STT bills per second of returned transcript; V2V bills per second of input audio.

Text-to-speech

POSThttps://api.usequota.ai/v1/audio/speech
Drop-in for the OpenAI SDK
The OpenAI Node and Python SDKs both support custom baseURL on audio.speech.create(). Point them at Quota and pass an elevenlabs/-prefixed model — no other changes needed.

Request

curl https://api.usequota.ai/v1/audio/speech \
  -H "Authorization: Bearer $QUOTA_API_KEY" \
  -H "Content-Type: application/json" \
  --output speech.mp3 \
  -d '{
    "model": "elevenlabs/eleven_multilingual_v2",
    "input": "The quick brown fox jumps over the lazy dog.",
    "voice": "rachel",
    "response_format": "mp3"
  }'

Request body

modelstringrequiredMust be prefixed with elevenlabs/. See TTS models below.
inputstringrequiredText to synthesize. Hard cap of 5,000 characters per request — split longer text yourself. ElevenLabs bills per character, so input.length directly determines cost.
voicestringVoice name (e.g. rachel, adam, bella) or a 20-character ElevenLabs voice ID. Defaults to rachel. See Voices.
response_formatstringOne of mp3, opus, aac, flac, wav, pcm. Defaults to mp3. ElevenLabs HTTP TTS only emits MP3 and PCM natively — opus/aac fall back to MP3, flac/wav fall back to PCM.
speednumberPlayback speed multiplier, between 0.5 and 2 (ElevenLabs supports roughly 0.7–1.2). Forwarded as voice_settings.speed.
streambooleanWhen true, audio bytes stream back chunked from ElevenLabs's streaming endpoint instead of buffering server-side — first-byte latency drops noticeably for long inputs. Billing is identical: TTS cost is known up-front from input.length, so the deduction is the same either way. Defaults to false.

Streaming

Set stream: true to receive audio bytes chunk-by-chunk as ElevenLabs renders them, instead of waiting for the full file. First-byte latency drops noticeably for long inputs — useful when you're piping audio straight to a player or speaker. Billing is identical to one-shot (TTS cost is known up-front from input.length), so the only thing that changes is when you start getting bytes.

Not exposed on openai.audio.speech.create()
The OpenAI SDKs don't type a stream field on audio.speech.create(), so you either (a) pass it via the SDK's extra-body escape hatch, or (b) call fetch directly and read response.body as a stream. The latter is simpler and works in every runtime.
# Pipe chunks straight to ffplay (or any stdin-reading player)
curl -N https://api.usequota.ai/v1/audio/speech \
  -H "Authorization: Bearer $QUOTA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs/eleven_turbo_v2_5",
    "input": "Streaming is great for long passages — first audio byte arrives in under a second.",
    "voice": "rachel",
    "stream": true
  }' | ffplay -autoexit -nodisp -

Response headers (X-Quota-Credits-Used, X-Quota-Balance, etc.) are sent before the first audio chunk, so they're readable as soon as fetch resolves — no need to wait for response.body to drain.

Response

Binary. The response body is raw audio bytes. The Content-Type header reflects the requested format: audio/mpeg for mp3/opus/aac, audio/wav for wav/flac, audio/pcm for pcm. With stream: true the body uses chunked transfer encoding; otherwise it's a single buffered payload.

Billing metadata comes back in response headers:

X-Quota-Credits-UsedheaderTotal credits debited (cost + markup).
X-Quota-BalanceheaderWallet balance after the deduction.
X-Quota-CharactersheaderCharacter count actually billed (matches input length).
X-Quota-MarkupheaderMarkup credits added on top of base cost (only when configured).

Pricing

Billed per character of input. Output audio length does not factor in.

elevenlabs/eleven_multilingual_v2$0.18 / 1k charsHigh-quality multilingual model. Best fit for long-form, multi-language, or expressive content.
elevenlabs/eleven_turbo_v2_5$0.10 / 1k charsFaster, lower-latency Turbo model. Good default for interactive applications.
elevenlabs/eleven_flash_v2_5$0.10 / 1k charsLowest-latency Flash model. Pick this for real-time agent voices.
How character billing maps onto credits
Quota stores all balances in credits (1,000,000 credits = $1.00). For TTS, credits = (input.length / 1000) × cost_per_1k, with any per-app markup added on top. The full charge is deducted up front, then refunded if ElevenLabs returns an error.

Voices

Pass a curated short name or a raw 20-character ElevenLabs voice ID. Unknown names fall back to rachel. Voice IDs accept anything 20 characters or longer of mixed case alphanumerics — useful for voices you've cloned in your ElevenLabs account.

racheldefaultCalm, warm female voice. The default.
domipresetConfident female voice.
bellapresetSoft female voice.
ellipresetYoung female voice.
antonipresetWell-rounded male voice.
joshpresetDeep male voice.
arnoldpresetCrisp male voice.
adampresetMature male voice.
sampresetRaspy male voice.

TTS models

Always pass the full elevenlabs/-prefixed model ID. Bare ElevenLabs IDs (without prefix) return invalid_request.

elevenlabs/eleven_multilingual_v2multilingualHighest quality. 29 languages.
elevenlabs/eleven_turbo_v2_5turboFaster, lower latency. Good general-purpose default.
elevenlabs/eleven_flash_v2_5flashLowest latency. For real-time agents.

Speech-to-text

POSThttps://api.usequota.ai/v1/audio/transcriptions
Drop-in for the OpenAI SDK
Send multipart/form-data with a file part and model: "elevenlabs/scribe_v1". The OpenAI SDK's audio.transcriptions.create() produces this shape natively — point its baseURL at Quota.

Request

curl https://api.usequota.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $QUOTA_API_KEY" \
  -F file=@meeting.mp3 \
  -F model=elevenlabs/scribe_v1 \
  -F response_format=verbose_json

Form fields

filefilerequiredAudio bytes. Max 25 MB. Supported MIME types: mp3, wav, m4a, mp4, webm, ogg, flac, aac.
modelstringrequiredMust be elevenlabs/scribe_v1. No other STT models are supported yet.
languagestringISO 639-1 code (e.g. en, fr). Auto-detected when omitted.
response_formatstringOne of json (default), text, verbose_json. verbose_json adds per-word timestamps and confidence.
diarizebooleanTag each word with a speaker_id for multi-speaker audio. Adds latency. Defaults to false.
tag_audio_eventsbooleanSurface non-speech events (laughter, applause) as entries in words[]. Defaults to false.
timestamps_granularitystringOne of none, word, character. Defaults to word.

Response

{ "text": "Hello from quota." }

Billing metadata comes back in response headers:

X-Quota-Credits-UsedheaderTotal credits debited after reconciliation against the exact audio duration.
X-Quota-BalanceheaderWallet balance after the deduction.
X-Quota-SecondsheaderBilled audio seconds — ceil of the last word's end time, or the upload's nominal duration when no words were detected.
X-Quota-MarkupheaderMarkup credits added on top of base cost (only when configured).

Pricing

Billed per second of audio. Quota pre-reserves a generous upper bound derived from the upload's nominal duration, then reconciles against the exact transcript duration once Scribe returns. Refunded in full on upstream error.

elevenlabs/scribe_v1$0.40 / hourHigh-accuracy multilingual transcription. Word-level timestamps, optional diarization, optional non-speech event tagging.

Voice-to-voice (V2V)

WSwss://api.usequota.ai/v1/audio/voice-conversion
Quota-only surface — not OpenAI-compatible
OpenAI's audio API has no voice-conversion shape, so this route is a Quota-native WebSocket. Use the @usequota/core connectVoiceConversion() helper or the @usequota/nextjs useVoiceConversion() React hook for browser clients (PCM AudioWorklet encoder included). Native callers can speak the protocol below directly.

Stream audio in, receive audio re-rendered in a target voice out. Billed per second of input audio. Sessions are capped at 30 minutes; callers can declare a shorter cap at handshake.

Lifecycle

  1. Client opens the WebSocket. No Authorization header is required on the upgrade — browsers can't set one on new WebSocket(), so auth is the first text frame.
  2. Client sends a JSON auth frame (see below). Quota validates the token, looks up pricing, atomically reserves the max-session cost, opens an upstream WebSocket to ElevenLabs, and writes an audio_sessions audit row.
  3. Quota replies with a JSON auth_ack frame containing the session ID. From here, binary frames are bidirectional: client PCM in, ElevenLabs-converted audio out.
  4. Every 30 seconds Quota does an in-memory insolvency check against your declared max duration. Heartbeat pings run on the same cadence; two missed pongs (~65 s) close the session.
  5. On any close — client, upstream, timeout, balance exhaustion, format violation — Quota refunds reservation − actual_used and updates the audit row. The session ID is returned in the close frame so you can correlate to your own telemetry.

Auth frame (client → server)

{
  "type": "auth",
  "token": "sk-quota-...",
  "model": "elevenlabs/voice-conversion-v1",
  "voice": "rachel",
  "format": "pcm_16le_16k_mono",
  "max_duration_seconds": 600
}
typestringrequiredMust be "auth".
tokenstringrequiredQuota API key (sk-quota-…) or an OAuth access token (quota_token_…) for user-scoped billing.
modelstringrequiredMust be elevenlabs/voice-conversion-v1.
voicestringrequiredElevenLabs voice ID, or one of the named presets supported by TTS (rachel, adam, etc.).
formatstringrequiredInput audio format. One of:
  • pcm_16le_16k_mono — 16-bit LE PCM, 16 kHz, mono (32 kB/s)
  • pcm_16le_24k_mono — 16-bit LE PCM, 24 kHz, mono (48 kB/s)
  • pcm_16le_16k_stereo — 16-bit LE PCM, 16 kHz, stereo (64 kB/s)
Quota enforces the format's declared byte-rate with a sliding 1-second window at 1.5× ceiling — sending compressed audio while claiming PCM closes the session with 4005 FORMAT_VIOLATION.
max_duration_secondsintegerOptional. Caps the session length. Hard server-side ceiling is 1800 (30 minutes). The pre-reservation is sized from this value, so smaller values lower the up-front credit hold.

Auth ack (server → client)

{
  "type": "auth_ack",
  "session_id": "9f4c…",
  "upstream_ready": true,
  "max_duration_seconds": 600
}

After auth_ack, both directions switch to binary frames. Send raw PCM bytes; receive ElevenLabs-encoded MP3 (44.1 kHz, 128 kbps).

Close codes

Quota uses application-defined close codes (RFC 6455 §7.4.2, ≥ 4000). The Quota SDKs map these to typed errors; native callers should read event.code.

CodeNameMeaning
4002BALANCE_EXHAUSTEDSession exceeded its max_duration_seconds. Bill the user for what they used; the reservation diff is refunded.
4003UPSTREAM_DISCONNECTEDElevenLabs closed the upstream socket. Reconnect with a new session if needed.
4004SESSION_TIMEOUTSession hit the 30-minute hard cap.
4005FORMAT_VIOLATIONInput bytes/second exceeded 1.5× the declared format ceiling. Indicates a format misdeclaration or compressed-as-PCM attack.
4006INVALID_AUTHFirst frame missing, malformed, or token invalid. No upstream connection was opened.
4007HEARTBEAT_LOSTTwo server pings went unanswered (~65 s). Common cause is a dropped network — clients should reconnect.

Pricing

elevenlabs/voice-conversion-v1 bills per second of input audio at $9.00 per hour. The reservation at handshake is max_duration_seconds × price; the diff is refunded on close. Output audio length is ≈ 1:1 with input and is not billed separately.

Browser quickstart

import { useVoiceConversion } from "@usequota/nextjs";

export function VoiceClone() {
  const { start, stop, state, error } = useVoiceConversion({
    model: "elevenlabs/voice-conversion-v1",
    voice: "rachel",
    format: "pcm_16le_16k_mono",
    maxDurationSeconds: 600,
  });

  return (
    <button onClick={state === "open" ? stop : start}>
      {state === "open" ? "Stop" : "Speak"}
      {error && <span> — {error.message}</span>}
    </button>
  );
}

User-scoped billing

For end-user-pays flows, pass the user's OAuth access token (quota_token_…) as the bearer instead of your API key. Quota debits the user's wallet rather than yours. Works identically for /v1/audio/speech, /v1/audio/transcriptions, and the V2V WebSocket (pass the OAuth token in the auth frame's token field). See Sign in with Quota or Connect Quota Wallet for the end-to-end OAuth flow.

Errors

Errors use the standard Quota envelope:

{
  "error": {
    "code": "insufficient_credits",
    "message": "Insufficient credits. Balance: 0, Required: 90",
    "hint": "..."
  }
}
400invalid_requestMissing required fields; input over 5,000 chars (TTS) or malformed multipart (STT); unsupported response_format; or model not prefixed with elevenlabs/.
401invalid_api_keyToken missing, revoked, or for the wrong environment.
402insufficient_creditsWallet balance below the pre-reservation amount.
404model_not_foundModel is not in the pricing table.
404user_not_foundExternal user ID does not exist on this app.
413file_too_largeSTT only. Upload exceeds 25 MB — split the audio yourself before sending.
502upstream_errorElevenLabs returned an error. Pre-deducted credits are refunded automatically.
503provider_unavailableELEVENLABS_API_KEY is missing or was rejected at the provider.