June 22, 2026·6 min read

From a 60-Minute Recording to a Folder of Highlight Clips, in One Job

Transcribe a video, and whenever someone says a keyword, cut a 30-second clip there and drop the clips plus the transcript in a folder. It sounds like a script you'd babysit for an afternoon. With Botverse it's two tool calls and a poll loop — and it shows exactly where an agent's job ends and a workflow engine's begins.

Here's a request we hear in a dozen variations: "Take this recording, find every time someone mentions [the product / the budget / a competitor], cut a 30-second clip at each spot, and put the clips and the transcript in a folder."

It's the kind of thing that sounds like a long afternoon of glue code — ffmpeg incantations, a transcript parser, a loop, error handling for the clip that runs past the end of the file. With Botverse it's a short, self-running job. But building it well means understanding one thing about workflows, and it's the most useful mental model we can give you.

A workflow is a static graph. That's a feature.

Botverse workflows are declarative: you describe every step up front, submit once, and poll until done. The engine handles parallelism, dependencies, retries, and partial failure. What it deliberately does not do is make content-based decisions at runtime. A step can branch on another step's status — "run only if transcription succeeded" — but it can't branch on what the transcript says.

That matters here, because the number of clips is data-dependent. You don't know how many times the keyword is spoken, or where, until the transcript exists. Deciding where to cut is a reasoning task. Cutting — in parallel, with retries, without stranding files — is an execution task. Botverse's whole design rests on keeping those two separate: agents orchestrate, services execute.

So the job runs in two phases.

Phase 1 — transcribe, then find the moments

Transcribe the video to JSON. That gives the agent structured segments — start time, end time, speaker, text — to search.

const { job_id } = await callTool("transcribe_from_url", {
  source_url: recordingUrl,
  output_format: "json",
  options: { attendees }
});
const transcript = await pollAndFetchJson(job_id);

const KEYWORD = "budget", CLIP = 30;
const hits = transcript
  .filter(seg => seg.text.toLowerCase().includes(KEYWORD))
  .map(seg => ({ start: Math.max(0, Math.floor(seg.start) - 2) }));

This is the part a static graph can't do, and the part an agent does well: read the transcript, apply judgment — a keyword, a sentiment, a topic, "anywhere the CFO is speaking" — and produce a list of timestamps. Two seconds of lead-in so the clip doesn't start mid-word.

Phase 2 — one workflow, every clip in parallel

Now the agent generates a single workflow with one clip step per hit, plus the transcript document, and submits it once. Each clip is a transcode that trims the source with start_time and duration. They all run in parallel; each is marked CONTINUE so one bad cut never sinks the batch.

const definition = {
  workflow_id: "highlights-" + Date.now(),
  params: { recording_url: recordingUrl },
  steps: [
    ...hits.map((h, i) => ({
      id: "clip_" + i,
      tool: "transcode_from_url",
      failure_mode: "CONTINUE",
      inputs: {
        source_url: "$.params.recording_url",
        output_format: "mp3",
        options: { start_time: String(h.start), duration: 30 }
      }
    })),
    {
      id: "transcript",
      tool: "transcribe_from_url",
      inputs: {
        source_url: "$.params.recording_url",
        output_format: "docx",
        options: { attendees }
      }
    }
  ]
};

const { workflow_id } = await callTool("submit_workflow", { definition });

let result;
do {
  await new Promise(r => setTimeout(r, 8000));
  result = await callTool("get_workflow_status", { workflow_id });
} while (!["COMPLETED","FAILED","PARTIALLY_FAILED","CANCELLED"].includes(result.status));

for (const step of result.steps) {
  if (step.output_url) await download(step.output_url, "./highlights/");
}

Ten keyword hits become ten audio clips cut at the same time, not one after another. The agent submitted once and walked away; it never held the video, the audio, or the transcript in its context. When the poll loop ends, the folder is full.

Why not push the keyword search into the engine too?

We could add a "find keyword" primitive and a dynamic fan-out construct to BWDL. We've chosen not to, and the reason is the same reason the model is clean: the moment a workflow engine starts interpreting content and spawning steps based on what it finds, it stops being a deterministic executor and becomes a second, worse agent. Keyword today; sentiment, intent, and "the important bits" tomorrow — that's reasoning, and reasoning belongs in the agent. The engine stays simple, predictable, and fast, and the agent stays in charge of the judgment calls. Each does the thing it's good at.

The transcribe tools — transcribe_from_url and transcribe_media — are now first-class workflow steps alongside transcode and convert, so phases like this compose cleanly. The full BWDL reference, including both worked examples, is at botverse.cloud/docs/workflows.

Ready to connect your agent to Botverse? Set up in five minutes. No contracts, no minimums.

Ready to connect your agent to Botverse?

Set up in five minutes. No contracts, no minimums.

Get started