All docs
ReferenceSpecs

BFF API Reference — `/api/v2/*`

Specification for every endpoint the vibi mobile client calls. The Swagger UI (http://localhost:8080/swagger) is the authoritative single source of truth; this markdown is an offline lookup / citation copy.

  • All responses are JSON. Error shape: error-contract.md.
  • Auth: some endpoints require Authorization: Bearer <jwt> (BFF-issued JWT). Issuance: POST /api/v2/auth/google.
  • multipart upload form-field limits are noted separately (must be called out when over 50MB — see vibi-bff/CLAUDE.md "Known BFF bug patterns").
  • Full response model definitions: vibi-bff/src/main/kotlin/com/vibi/bff/model/BffModels.kt.

Table of contents:


Authentication

POST /api/v2/auth/google

Exchanges a Google ID Token for a BFF JWT.

Request — JSON { "idToken": "<google_id_token>" }. The aud must match the BFF's GOOGLE_OAUTH_CLIENT_IDS whitelist.

ResponseAuthResponseDto (JWT body + expiration time).

Errors — Google tokeninfo 4xx → 401.


Language list

GET /api/v2/languages

Proxies the subtitle/dubbing target languages supported by Perso as-is. Also useful as a post-boot health check — verifies that the Perso key and spaceSeq are live.

ResponseLanguageListResponse (array of code + display name).


Chat (Gemini function calling)

POST /api/v2/chat

Natural-language editing assistant. The user message is sent to Gemini, which returns either plain text or a pre-registered function call (proposal.steps). On the mobile side, ChatToolDispatcher maps function names to TimelineViewModel.onXxx.

Request — JSON ChatRequestDto:

{
  "messages": [...],
  "projectContext": {
    "segments": [...], "subtitleClips": [...], "dubClips": [...], "bgmClips": [...], "stems": [...],
    "separationDirectives": [...],   // already-separated ranges, with rangeStartMs/EndMs + numberOfSpeakers
    "currentPlayheadMs": 1234,
    "selectedClipId": "...", "selectedSegmentId": "...",
    "videoDurationMs": 60000,
    // Range selection — when the user has dragged a window in the timeline UI
    "isRangeSelecting": true,
    "pendingRangeStartMs": 5000,
    "pendingRangeEndMs": 12000
  },
  "locale": "ko"
}

isRangeSelecting + pendingRange* lets Gemini resolve deictic phrases like "이 구간 / this range" without guessing — the model is instructed to pass those values straight through as startMs/endMs. separationDirectives is what powers the overlap-avoidance policy (see below).

ResponseChatResponseDto with two kinds:

  • {"kind":"text","text":"..."} — read / explanatory response, or a follow-up question when refs can't be resolved.
  • {"kind":"proposal","steps":[{"name":"...","args":{...}}]} — up to 5 edit proposals. Applied after user confirmation in the mobile UI (ChatToolDispatcher).

Registered tools (vibi-bff/.../service/ChatToolDefs.kt, 15 total):

GroupTools
Segment edit (range)delete_segment_range, duplicate_segment_range, update_segment_volume, update_segment_speed
Audio separationseparate_audio_range, update_stem_volume
Subtitle — two-stagetranscribe_for_subtitlesapply_subtitles_with_script (preferred)
Subtitle — directupdate_subtitle_text, generate_subtitles (deprecated — kept for "skip review" requests)
Dubgenerate_dub
BGMmove_bgm_clip, update_bgm_volume, generate_subtitles_for_bgm, generate_dub_for_bgm

The full tool spec is in vibi-bff/src/main/resources/chat-tools.md — that file is loaded verbatim as systemInstruction, so it is the single source of truth for tool args, the cost-disclosure rule (slow tools get (예상 ~N분) in rationale), and the policy below.

Edit-invalidation policy — When the user requests structural segment edits (delete_segment_range, duplicate_segment_range, update_segment_speed) AND the project already has separation / subtitles / dubs / BGM placements, the model must append an explicit warning to the proposal rationale: "⚠ 이 편집을 적용하면 기존 음성분리/자막/더빙/음원 배치가 초기화됩니다. 진행할까요?". update_segment_volume is exempt (mix change, not structural).

Separation overlap policyseparate_audio_range cannot be applied to a range that overlaps an existing separationDirectives[] entry (the mobile side rejects it). On overlap the model must respond with kind=text and present three priced options: (A) replace existing, (B) split off the non-overlapping portion, (C) shorter multi-piece separation — cost-optimized in the order B > C > A.

Tool-code fallback — Gemini 2.5 Flash occasionally ignores functionDeclarations and replies with a tool_code markdown block instead. GeminiClient.recoverToolCallsFromText reconstructs the ToolCalls so the client still sees a proposal. Two formats are recognized: print(default_api.<name>(kwargs)) with Python literal args (str/int/float/bool/list), and YAML-style pseudo-tool-code with a separate rationale field.

System instruction constraints: respond in the user's last-turn language (locale is fallback only) / never auto-translate subtitle text / never invent IDs or timestamps not in projectContext / one kind per response / required args must never be guessed (ask instead).


Auto subtitles

Perso STT → Gemini translation → signed SRT URL.

POST /api/v2/subtitles

Request — multipart:

FieldDescription
file (optional)Source video/audio. Omit when reusing a prior render via spec.editedRenderJobId.
specJSON SubtitleSpec (form-item, not a file). Optional editedRenderJobId reuses an existing /api/v2/render output as the source — avoids re-uploading large edited videos.

ResponseSubtitleJobResponse { "jobId": "sub-..." }.

GET /api/v2/subtitles/{jobId}

Status poll. When READY, the response includes signed originalSrtUrl and translatedSrtUrlsByLang. Token lifetime: SEPARATION_URL_TTL_SEC.

GET /api/v2/subtitles/{jobId}/srt?token=…&lang=…

Signed SRT stream. lang omitted (or lang=original) returns the original-language SRT; an ISO code returns the translated SRT for that language. 401 if no token, 403 if expired.

POST /api/v2/subtitles/regenerate

Regenerates subtitles in other languages using a user-edited SRT as the source (only calls Gemini; no video/audio re-upload).

Request — multipart srtFile + spec.


Auto dubbing

Perso dubbing job (a single submitTranslate — Perso handles translation and voice synthesis as a black box).

POST /api/v2/autodub

Request — multipart:

FieldDescription
file (optional)Source video/audio. Omit when reusing a prior render via spec.editedRenderJobId.
specJSON AutoDubSpectargetLanguageCodes, voice options, trimming, optional editedRenderJobId.

ResponseAutoDubJobResponse { "jobId": "dub-..." }.

GET /api/v2/autodub/{jobId}

Status + signed dubbedAudioUrl / dubbedVideoUrl on READY.

GET /api/v2/autodub/{jobId}/video?token=…

Signed mp4 stream of the dubbed video (VIDEO projects). 401 if no token, 403 if expired.

GET /api/v2/autodub/{jobId}/audio?token=…

Signed mp3 stream of the dubbed audio. For VIDEO projects this is the audio track extracted via ffmpeg -vn.


Audio separation + remix

Perso voice separation exposes per-speaker stems (speaker_0..N) + voice_all + background; only the stems the user selects are remixed back together via amix.

POST /api/v2/separate

Request — multipart:

FieldDescription
file (optional)mp4 or mp3 (≤ MAX_UPLOAD_FILE_SIZE_MB, default 500). Omit when reusing a prior render via spec.editedRenderJobId.
specJSON SeparationSpec (optional editedRenderJobId for render-output reuse)
// SeparationSpec
{
  "mediaType": "VIDEO",          // "VIDEO" or "AUDIO"
  "numberOfSpeakers": 2,         // 1..10 — optional hint; Perso auto-detects when omitted
  "sourceLanguageCode": "auto",  // "auto", or an ISO code like "ko"
  "trimStartMs": 2000,           // optional trim — both bounds required together
  "trimEndMs": 8500
}

Trim validation (spec.trim*):

ConditionResponse
Only one side present400 partial_trim_range
trimStartMs < 0400 trim_start_negative
trimEndMs <= trimStartMs400 trim_range_invalid
Window < 500ms400 trim_range_too_short
trimEndMs exceeds measured duration400 trim_end_exceeds_duration (detail contains measured duration)
ffprobe / ffmpeg failure500 ffmpeg_error

ResponseSeparationJobResponse { "jobId": "sep-..." }.

GET /api/v2/separate/{jobId}

Status + signed URL per stem on READY.

stemIdLabelSource
voice_allAll speakersPerso originalVoiceAudio
speaker_0, speaker_1, …Speaker 1, 2, …Perso originalVoiceSpeakers (ZIP unzip)
backgroundBackground audio(legacy separation flow only — the new audio-separation API does not provide background)

GET /api/v2/separate/{jobId}/stem/{stemId}?token=…

Signed stem audio stream. 401 if no token, 403 if expired.

POST /api/v2/separate/{jobId}/mix

Request — JSON MixRequest:

{ "stems": [ { "stemId": "speaker_0", "volume": 1.0 }, { "stemId": "voice_all", "volume": 0.4 } ] }

ResponseMixJobResponse { "mixJobId": "mix-..." }.

Side effect: on success, the source separation's stems are disposed. Subsequent /mix calls with the same jobId return 409 Conflict. Download any stems you need to keep before mixing.

GET /api/v2/separate/mix/{mixJobId}

Status + signed downloadUrl on COMPLETED.

GET /api/v2/separate/mix/{mixJobId}/download?token=…

mp3 stream. 401 if no token.


Render

Local ffmpeg pipeline. multi-segment concat → subtitle burn-in → sticker overlay → dub mix.

POST /api/v2/render/inputs

For multi-variant export, uploads video/audios once and caches them. The returned inputId is passed to submitRenderJob to avoid multipart re-upload cost.

Request — multipart video + audio_0, audio_1, …

ResponseRenderInputCacheResponse { "inputId": "..." }.

POST /api/v2/render

Request — multipart:

FieldDescription
inputId(optional) Cache ID returned by /render/inputs. If present, video/audio re-upload is skipped.
video (legacy) / video_0, video_1, …Source video.
segment_image_0, …Still image for IMAGE-type segments.
image_0, image_1, …Sticker overlay images (referenced from imageClips[]).
audio_0, audio_1, …Dub audio tracks (referenced from dubClips[]).
bgm_0, …BGM tracks.
audio_override(optional) Replaces the source video's original audio entirely.
subtitles(optional) ASS file — burned in as the final step.
configJSON RenderConfig (form-item).

RenderConfig — two modes:

// Legacy single-video
{
  "videoDurationMs": 60000,
  "outputKind": "video",   // "video" (default, mp4) | "audio" (m4a AAC 192k, video stages skipped)
  "dubClips":  [ { "audioFileKey": "audio_0", "startMs": 1000, "durationMs": 4000, "volume": 1.0 } ],
  "imageClips":[ { "imageFileKey": "image_0", "startMs": 2000, "endMs": 5000,
                   "xPct": 50, "yPct": 30, "widthPct": 25, "heightPct": 25 } ]
}

// Multi-segment
{
  "segments": [
    { "sourceFileKey": "video_0", "type": "VIDEO", "order": 0,
      "durationMs": 30000, "trimStartMs": 0, "trimEndMs": 30000,
      "width": 1920, "height": 1080 },
    { "sourceFileKey": "segment_image_0", "type": "IMAGE", "order": 1,
      "durationMs": 5000, "imageWidthPct": 80, "imageHeightPct": 80 }
  ],
  "imageClips": [ /* sticker overlays on the global timeline */ ],
  "dubClips":   [ /* dub clips on the global timeline */ ]
}

Output resolution: segments[0].width/height in multi-segment mode, ffprobe extraction in legacy mode.

outputKind="audio" skips video concat / subtitle burn-in / sticker overlay and emits audio-only m4a — used as the source for downstream /subtitles or /separate calls via editedRenderJobId. Cuts those job latencies by 5–10x compared to re-rendering video and stripping audio.

ResponseRenderJobResponse { "jobId": "rnd-..." }.

GET /api/v2/render/{jobId}/status

PENDING / PROCESSING / COMPLETED / FAILED + progress.

GET /api/v2/render/{jobId}/download

mp4 byte stream. Not statically mounted — always go through this endpoint.


Code references

  • Route definitions: vibi-bff/src/main/kotlin/com/vibi/bff/routes/{Auth,AutoDub,Chat,Language,Render,Separation,Subtitle}Routes.kt
  • DTOs: vibi-bff/.../model/BffModels.kt, ChatModels.kt, PersoModels.kt, AuthModels.kt
  • Client-side callers: vibi-mobile/shared/src/commonMain/kotlin/com/vibi/shared/data/remote/api/BffApi.kt
  • OpenAPI: vibi-bff/src/main/resources/openapi/vibi-bff.yaml