session_id, which must be obtained first by calling the face identification endpoint.
Workflow Overview
Input Modes
Text-Driven — Built-in TTS
Providetext, voice_id, and voice_language. The platform synthesizes speech from the text using the specified voice, then drives the lip movement.
Audio-Driven — Using an Existing Audio File
Provideaudio_url to drive lip movement directly with an audio file.
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
input.session_id | string | ✅ | Session ID returned from the face identification step |
input.face_image_url | string | No | Face reference image URL for improved character consistency |
input.text | string | Required (text mode) | The text the character should speak |
input.voice_id | string | Required (text mode) | TTS voice ID. See the voice ID reference to preview and choose a voice. |
input.voice_language | string | Required (text mode) | Language code: zh (Chinese) or en (English) |
input.audio_url | string | Required (audio mode) | Public URL of the audio file |
Polling Results
After creating a task, useGET /kling/v1/videos/advanced-lip-sync/{task_id} to query the status. Refer to the task query documentation. Status progression: queued → processing → succeeded / failed.
On success, the video download URL is at data.data.task_result.videos[0].url.
Prerequisite: Face Identification
Must call this endpoint first to obtain the session_id.
Voice ID Reference
Preview all available voices online and choose the right voice_id parameter value.
API Reference
View the interactive API documentation for Kling Lip Sync Generation.

