Audio

Three audio surfaces, all backed by ElevenLabs. The HTTP routes mirror OpenAI's audio.speech and audio.transcriptions APIs — the OpenAI SDKs work as-is with a swapped baseURL. The voice-to-voice route is a WebSocket because the call is bidirectional and long-lived; there is no OpenAI SDK shape for it, so callers either use the Quota SDK or speak the protocol directly. TTS bills per input character; STT bills per second of returned transcript; V2V bills per second of input audio.

Text-to-speech

POSThttps://api.usequota.ai/v1/audio/speech

Drop-in for the OpenAI SDK

The OpenAI Node and Python SDKs both support custom baseURL on audio.speech.create(). Point them at Quota and pass an elevenlabs/-prefixed model — no other changes needed.

Request

curl https://api.usequota.ai/v1/audio/speech \
  -H "Authorization: Bearer $QUOTA_API_KEY" \
  -H "Content-Type: application/json" \
  --output speech.mp3 \
  -d '{
    "model": "elevenlabs/eleven_multilingual_v2",
    "input": "The quick brown fox jumps over the lazy dog.",
    "voice": "rachel",
    "response_format": "mp3"
  }'

Request body

modelstringrequired	Must be prefixed with `elevenlabs/`. See TTS models below.
inputstringrequired	Text to synthesize. Hard cap of 5,000 characters per request — split longer text yourself. ElevenLabs bills per character, so `input.length` directly determines cost.
voicestring	Voice name (e.g. `rachel`, `adam`, `bella`) or a 20-character ElevenLabs voice ID. Defaults to `rachel`. See Voices.
response_formatstring	One of `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`. Defaults to `mp3`. ElevenLabs HTTP TTS only emits MP3 and PCM natively — `opus`/`aac` fall back to MP3, `flac`/`wav` fall back to PCM.
speednumber	Playback speed multiplier, between 0.5 and 2 (ElevenLabs supports roughly 0.7–1.2). Forwarded as `voice_settings.speed`.
streamboolean	When `true`, audio bytes stream back chunked from ElevenLabs's streaming endpoint instead of buffering server-side — first-byte latency drops noticeably for long inputs. Billing is identical: TTS cost is known up-front from `input.length`, so the deduction is the same either way. Defaults to `false`.

Streaming

Set stream: true to receive audio bytes chunk-by-chunk as ElevenLabs renders them, instead of waiting for the full file. First-byte latency drops noticeably for long inputs — useful when you're piping audio straight to a player or speaker. Billing is identical to one-shot (TTS cost is known up-front from input.length), so the only thing that changes is when you start getting bytes.

Not exposed on openai.audio.speech.create()

The OpenAI SDKs don't type a stream field on audio.speech.create(), so you either (a) pass it via the SDK's extra-body escape hatch, or (b) call fetch directly and read response.body as a stream. The latter is simpler and works in every runtime.

# Pipe chunks straight to ffplay (or any stdin-reading player)
curl -N https://api.usequota.ai/v1/audio/speech \
  -H "Authorization: Bearer $QUOTA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs/eleven_turbo_v2_5",
    "input": "Streaming is great for long passages — first audio byte arrives in under a second.",
    "voice": "rachel",
    "stream": true
  }' | ffplay -autoexit -nodisp -

Response headers (X-Quota-Credits-Used, X-Quota-Balance, etc.) are sent before the first audio chunk, so they're readable as soon as fetch resolves — no need to wait for response.body to drain.

Response

Binary. The response body is raw audio bytes. The Content-Type header reflects the requested format: audio/mpeg for mp3/opus/aac, audio/wav for wav/flac, audio/pcm for pcm. With stream: true the body uses chunked transfer encoding; otherwise it's a single buffered payload.

Billing metadata comes back in response headers:

X-Quota-Credits-Usedheader	Total credits debited (cost + markup).
X-Quota-Balanceheader	Wallet balance after the deduction.
X-Quota-Charactersheader	Character count actually billed (matches input length).
X-Quota-Markupheader	Markup credits added on top of base cost (only when configured).

Pricing

Billed per character of input. Output audio length does not factor in.

elevenlabs/eleven_multilingual_v2$0.18 / 1k chars	High-quality multilingual model. Best fit for long-form, multi-language, or expressive content.
elevenlabs/eleven_turbo_v2_5$0.10 / 1k chars	Faster, lower-latency Turbo model. Good default for interactive applications.
elevenlabs/eleven_flash_v2_5$0.10 / 1k chars	Lowest-latency Flash model. Pick this for real-time agent voices.

How character billing maps onto credits

Quota stores all balances in credits (1,000,000 credits = $1.00). For TTS, credits = (input.length / 1000) × cost_per_1k, with any per-app markup added on top. The full charge is deducted up front, then refunded if ElevenLabs returns an error.

Voices

Pass a curated short name or a raw 20-character ElevenLabs voice ID. Unknown names fall back to rachel. Voice IDs accept anything 20 characters or longer of mixed case alphanumerics — useful for voices you've cloned in your ElevenLabs account.

racheldefault	Calm, warm female voice. The default.
domipreset	Confident female voice.
bellapreset	Soft female voice.
ellipreset	Young female voice.
antonipreset	Well-rounded male voice.
joshpreset	Deep male voice.
arnoldpreset	Crisp male voice.
adampreset	Mature male voice.
sampreset	Raspy male voice.

TTS models

Always pass the full elevenlabs/-prefixed model ID. Bare ElevenLabs IDs (without prefix) return invalid_request.

elevenlabs/eleven_multilingual_v2multilingual	Highest quality. 29 languages.
elevenlabs/eleven_turbo_v2_5turbo	Faster, lower latency. Good general-purpose default.
elevenlabs/eleven_flash_v2_5flash	Lowest latency. For real-time agents.

Speech-to-text

POSThttps://api.usequota.ai/v1/audio/transcriptions

Drop-in for the OpenAI SDK

Send multipart/form-data with a file part and model: "elevenlabs/scribe_v1". The OpenAI SDK's audio.transcriptions.create() produces this shape natively — point its baseURL at Quota.

Request

curl https://api.usequota.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $QUOTA_API_KEY" \
  -F file=@meeting.mp3 \
  -F model=elevenlabs/scribe_v1 \
  -F response_format=verbose_json

Form fields

filefilerequired	Audio bytes. Max 25 MB. Supported MIME types: mp3, wav, m4a, mp4, webm, ogg, flac, aac.
modelstringrequired	Must be `elevenlabs/scribe_v1`. No other STT models are supported yet.
languagestring	ISO 639-1 code (e.g. `en`, `fr`). Auto-detected when omitted.
response_formatstring	One of `json` (default), `text`, `verbose_json`. `verbose_json` adds per-word timestamps and confidence.
diarizeboolean	Tag each word with a `speaker_id` for multi-speaker audio. Adds latency. Defaults to `false`.
tag_audio_eventsboolean	Surface non-speech events (laughter, applause) as entries in `words[]`. Defaults to `false`.
timestamps_granularitystring	One of `none`, `word`, `character`. Defaults to `word`.

Response

{ "text": "Hello from quota." }

Billing metadata comes back in response headers:

X-Quota-Credits-Usedheader	Total credits debited after reconciliation against the exact audio duration.
X-Quota-Balanceheader	Wallet balance after the deduction.
X-Quota-Secondsheader	Billed audio seconds — ceil of the last word's end time, or the upload's nominal duration when no words were detected.
X-Quota-Markupheader	Markup credits added on top of base cost (only when configured).

Pricing

Billed per second of audio. Quota pre-reserves a generous upper bound derived from the upload's nominal duration, then reconciles against the exact transcript duration once Scribe returns. Refunded in full on upstream error.

elevenlabs/scribe_v1$0.40 / hour

High-accuracy multilingual transcription. Word-level timestamps, optional diarization, optional non-speech event tagging.

Voice-to-voice (V2V)

WSwss://api.usequota.ai/v1/audio/voice-conversion

Quota-only surface — not OpenAI-compatible

OpenAI's audio API has no voice-conversion shape, so this route is a Quota-native WebSocket. Use the @usequota/core connectVoiceConversion() helper or the @usequota/nextjs useVoiceConversion() React hook for browser clients (PCM AudioWorklet encoder included). Native callers can speak the protocol below directly.

Stream audio in, receive audio re-rendered in a target voice out. Billed per second of input audio. Sessions are capped at 30 minutes; callers can declare a shorter cap at handshake.

Lifecycle

Client opens the WebSocket. No Authorization header is required on the upgrade — browsers can't set one on new WebSocket(), so auth is the first text frame.
Client sends a JSON auth frame (see below). Quota validates the token, looks up pricing, atomically reserves the max-session cost, opens an upstream WebSocket to ElevenLabs, and writes an audio_sessions audit row.
Quota replies with a JSON auth_ack frame containing the session ID. From here, binary frames are bidirectional: client PCM in, ElevenLabs-converted audio out.
Every 30 seconds Quota does an in-memory insolvency check against your declared max duration. Heartbeat pings run on the same cadence; two missed pongs (~65 s) close the session.
On any close — client, upstream, timeout, balance exhaustion, format violation — Quota refunds reservation − actual_used and updates the audit row. The session ID is returned in the close frame so you can correlate to your own telemetry.

Auth frame (client → server)

{
  "type": "auth",
  "token": "sk-quota-...",
  "model": "elevenlabs/voice-conversion-v1",
  "voice": "rachel",
  "format": "pcm_16le_16k_mono",
  "max_duration_seconds": 600
}

typestringrequired	Must be `"auth"`.
tokenstringrequired	Quota API key (`sk-quota-…`) or an OAuth access token (`quota_token_…`) for user-scoped billing.
modelstringrequired	Must be `elevenlabs/voice-conversion-v1`.
voicestringrequired	ElevenLabs voice ID, or one of the named presets supported by TTS (`rachel`, `adam`, etc.).
formatstringrequired	Input audio format. One of: `pcm_16le_16k_mono` — 16-bit LE PCM, 16 kHz, mono (32 kB/s) `pcm_16le_24k_mono` — 16-bit LE PCM, 24 kHz, mono (48 kB/s) `pcm_16le_16k_stereo` — 16-bit LE PCM, 16 kHz, stereo (64 kB/s) Quota enforces the format's declared byte-rate with a sliding 1-second window at 1.5× ceiling — sending compressed audio while claiming PCM closes the session with `4005 FORMAT_VIOLATION`.
max_duration_secondsinteger	Optional. Caps the session length. Hard server-side ceiling is `1800` (30 minutes). The pre-reservation is sized from this value, so smaller values lower the up-front credit hold.

Auth ack (server → client)

{
  "type": "auth_ack",
  "session_id": "9f4c…",
  "upstream_ready": true,
  "max_duration_seconds": 600
}

After auth_ack, both directions switch to binary frames. Send raw PCM bytes; receive ElevenLabs-encoded MP3 (44.1 kHz, 128 kbps).

Close codes

Quota uses application-defined close codes (RFC 6455 §7.4.2, ≥ 4000). The Quota SDKs map these to typed errors; native callers should read event.code.

Code	Name	Meaning
`4002`	`BALANCE_EXHAUSTED`	Session exceeded its `max_duration_seconds`. Bill the user for what they used; the reservation diff is refunded.
`4003`	`UPSTREAM_DISCONNECTED`	ElevenLabs closed the upstream socket. Reconnect with a new session if needed.
`4004`	`SESSION_TIMEOUT`	Session hit the 30-minute hard cap.
`4005`	`FORMAT_VIOLATION`	Input bytes/second exceeded 1.5× the declared format ceiling. Indicates a format misdeclaration or compressed-as-PCM attack.
`4006`	`INVALID_AUTH`	First frame missing, malformed, or token invalid. No upstream connection was opened.
`4007`	`HEARTBEAT_LOST`	Two server pings went unanswered (~65 s). Common cause is a dropped network — clients should reconnect.

Pricing

elevenlabs/voice-conversion-v1 bills per second of input audio at $9.00 per hour. The reservation at handshake is max_duration_seconds × price; the diff is refunded on close. Output audio length is ≈ 1:1 with input and is not billed separately.

Browser quickstart

import { useVoiceConversion } from "@usequota/nextjs";

export function VoiceClone() {
  const { start, stop, state, error } = useVoiceConversion({
    model: "elevenlabs/voice-conversion-v1",
    voice: "rachel",
    format: "pcm_16le_16k_mono",
    maxDurationSeconds: 600,
  });

  return (
    <button onClick={state === "open" ? stop : start}>
      {state === "open" ? "Stop" : "Speak"}
      {error && <span> — {error.message}</span>}
    </button>
  );
}

User-scoped billing

For end-user-pays flows, pass the user's OAuth access token (quota_token_…) as the bearer instead of your API key. Quota debits the user's wallet rather than yours. Works identically for /v1/audio/speech, /v1/audio/transcriptions, and the V2V WebSocket (pass the OAuth token in the auth frame's token field). See Sign in with Quota or Connect Quota Wallet for the end-to-end OAuth flow.

Errors

Errors use the standard Quota envelope:

{
  "error": {
    "code": "insufficient_credits",
    "message": "Insufficient credits. Balance: 0, Required: 90",
    "hint": "..."
  }
}

400invalid_requestMissing required fields; input over 5,000 chars (TTS) or malformed multipart (STT); unsupported response_format; or model not prefixed with elevenlabs/.

401invalid_api_keyToken missing, revoked, or for the wrong environment.

402insufficient_creditsWallet balance below the pre-reservation amount.

404model_not_foundModel is not in the pricing table.

404user_not_foundExternal user ID does not exist on this app.

413file_too_largeSTT only. Upload exceeds 25 MB — split the audio yourself before sending.

502upstream_errorElevenLabs returned an error. Pre-deducted credits are refunded automatically.

503provider_unavailableELEVENLABS_API_KEY is missing or was rejected at the provider.

→Chat completions

The OpenAI-compatible text endpoint. Pair with TTS for full voice agents.

→Models

List every model the current bearer token can use.