Audio Understanding

Authorizations

Authorization

string

header

required

All endpoints require Bearer Token authentication. Add to the request header:

Authorization: Bearer YOUR_API_KEY

YOUR_API_KEY is the API Token (sk-... format).

Body

application/json

model

string

required

Get Model List

Example:

"gemini-2.5-pro"

audio_url

string

required

Audio source. Accepts one of the following two forms:

Publicly reachable HTTP/HTTPS URL
data:audio/<type>;base64,<payload> data URI (base64 inline)

Audio format support per family (the specific available models are driven by channel configuration):

Gemini family (e.g. gemini-*): wav/mp3/aiff/aac/ogg/flac/m4a; total request body (prompt + system + inline files) ≤ 20 MB

Base64 data is not size-validated; oversized payloads may trigger 422.

Minimum string length: 1

Example:

"https://storage.googleapis.com/cloud-samples-tests/speech/brooklyn.flac"

prompt

string | null

User prompt. When omitted, defaults to 'Please transcribe this audio file', aligning with the transcription scenario.

Maximum string length: 100000

Example:

"Identify the speakers and emotion in this audio."

sync

boolean

default:false

Synchronous mode. When true, the endpoint blocks until the upstream completes and returns the full response (if stream=true at the same time, returns an SSE stream); when false, the endpoint returns the task ID immediately, and results are fetched via GET /v1/tasks/{task_id} or the SSE endpoint.

Example:

false

stream

boolean

default:false

Whether to stream. When true, the Submit response includes stream.url pointing to the SSE subscription path; streaming chunks are unified as the OpenAI chat.completion.chunk format.

Example:

false

max_tokens

integer | null

Generation token limit. Optional.

Required range: x >= 1

Example:

256

temperature

number | null

Sampling temperature, range [0, 2]. Optional.

Required range: 0 <= x <= 2

system_prompt

string | null

System instruction. Optional.

Maximum string length: 10000

reasoning

boolean | null

Whether to include reasoning tokens. Some thinking models require this to be set to true.

Response

Task created

Submit response, conforming to the unified task standard shape. results / error are fixed at null during submit; they are returned via GET /v1/tasks/{task_id} after the task completes or fails.

string

required

Task ID, formatted as task-llm-{timestamp}-{8random}.

Example:

"task-llm-1776874565-yq3szvcu"

object

enum<string>

required

Available options:

llm.generation.task

Example:

"llm.generation.task"

type

enum<string>

required

Available options:

llm

Example:

"llm"

model

string

required

The model name submitted by the client (echoed verbatim)

Example:

"gemini-2.5-pro"

status

enum<string>

required

Available options:

pending

Example:

"pending"

progress

integer

required

Example:

0

created

integer

required

Example:

1776874565

stream

object

Returns {url: ...} when stream=true; null when stream=false.

Show child attributes

results

object[] | null

Fixed at null during submit; returned via GET /v1/tasks/{task_id} after the task completes — results[0] is the full OpenAI ChatCompletion response (audio transcription / understanding output is in message.content).

Example:

null

error

object

Fixed at null during submit; returned via GET /v1/tasks/{task_id} when the task fails.

Example:

null