OpenAI Format - Transcriptions
- Generic Transcriptions API reference for all OpenAI-compatible speech-to-text models
- Convert audio files to text
- Supported models:
whisper-1(recommended),gpt-4o-transcribe,gpt-4o-mini-transcribe - Supports language hint, prompt for style guidance and multiple response formats
- Model-specific fields (timestamp granularities, streaming, diarization, etc.) are documented in the “Model-Specific Parameters” section below
Model-Specific Parameters
The OpenAI transcription endpoint exposes different fields depending on the model. The request body above only documents fields common to all models. The following sections describe model-specific or model-restricted fields.Supported Models
| Model ID | Description |
|---|---|
whisper-1 | Classic Whisper V2 model. Supports the broadest set of output formats and timestamp granularities |
gpt-4o-transcribe | High-accuracy transcription. Only json output. Streamable |
gpt-4o-mini-transcribe | Lightweight high-accuracy transcription. Only json output. Streamable |
gpt-4o-mini-transcribe-2025-12-15 | Versioned snapshot of gpt-4o-mini-transcribe |
gpt-4o-transcribe-diarize | Transcription with speaker diarization. Use diarized_json to receive per-segment speaker labels |
response_format Compatibility Matrix
| Model | Supported formats |
|---|---|
whisper-1 | json / text / srt / verbose_json / vtt |
gpt-4o-transcribe, gpt-4o-mini-transcribe(-2025-12-15) | json only |
gpt-4o-transcribe-diarize | json / text / diarized_json (use diarized_json to receive speaker annotations) |
whisper-1-Only Features
timestamp_granularities[]— array, allowed values:word/segment, default[segment]- Word / segment-level timestamp granularity
- Takes effect only when
response_format=verbose_json - Sent as repeated form field
timestamp_granularities[] - gpt-4o-* models cannot use this in practice (they only support
json);gpt-4o-transcribe-diarizeexplicitly disallows it
- Streaming not supported:
stream=trueis silently ignored onwhisper-1.
gpt-4o-* Series Parameters
Applies togpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-4o-mini-transcribe-2025-12-15.
-
include[]— array, allowed value:logprobs- Returns the log probabilities of each token, useful for assessing model confidence
- Only effective when
response_format=json - Not available on
whisper-1orgpt-4o-transcribe-diarize
-
stream— boolean, defaultfalse- Streams transcription results via SSE (Server-Sent Events)
- Ignored on
whisper-1
-
chunking_strategy—"auto"string orserver_vadobject- Controls how the audio is split into chunks. If unset, the audio is transcribed as a single block
-
When
"auto": the server normalizes loudness and then uses VAD to choose chunk boundaries -
When a
server_vadobject (manual VAD tuning):Field Type Default Description typestring — Required, must be "server_vad"prefix_padding_msinteger 300Audio (ms) included before VAD-detected speech silence_duration_msinteger 200Silence (ms) used to detect end of speech. Shorter values respond faster but may cut on short pauses
gpt-4o-transcribe-diarize-Only Parameters
Applies only to gpt-4o-transcribe-diarize (speaker-diarization model).
-
chunking_strategy— Required for inputs longer than 30 seconds (recommended:"auto") -
known_speaker_names[]— array, max 4- Identifier list for known speakers (e.g.
customer,agent) - Maps 1-to-1 with
known_speaker_references[]
- Identifier list for known speakers (e.g.
-
known_speaker_references[]— array, max 4- Reference audio for each speaker, in data URL format
- Each sample must be 2-10 seconds
- Same audio formats as the
filefield
Fields Not Supported by gpt-4o-transcribe-diarize
The following fields are not available on gpt-4o-transcribe-diarize:
| Field | Note |
|---|---|
prompt | Style/continuation prompt not supported |
timestamp_granularities[] | Word / segment timestamp granularity not configurable |
include[] | Additional returns like logprobs not supported |
stream | Streaming output not supported |
Authorizations
All APIs require Bearer Token authentication
Add to request header:
Authorization: Bearer YOUR_API_KEY
Body
Audio file to transcribe
Notes:
- Uploaded via multipart/form-data
- Supported formats: flac / mp3 / mp4 / mpeg / mpga / m4a / ogg / wav / webm
Speech-to-text model ID. Allowed values: whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe
"whisper-1"
ISO-639-1 language code of the input audio (e.g. en, zh, ja). Supplying this improves accuracy and latency.
"en"
Optional text to guide the model's style or to continue from a previous audio segment. The prompt should match the audio language.
Format of the transcription output
json, text, srt, verbose_json, vtt Sampling temperature between 0 and 1. Higher values produce more random output; 0 lets the model auto-tune.
0 <= x <= 1Response
Transcription response
- Option 1
- Option 2
Transcribed text
"The weather is nice today, let's go for a walk in the park."