June 18, 2026·6 min read

Botverse Transcribe: One Call, a Named Transcript, Done

Speech-to-text is a commodity. The hard part is knowing who said what and getting a usable document out the other end. Botverse Transcribe is one call: point it at a video or audio file, hand it the attendee names, and get back a clean, speaker-labelled transcript — txt, json, captions, or a formatted Word/PDF. We tested it against a commercial notetaker on a real 60-minute, 17-person call. It named 10 of 11 speakers correctly.

Transcribing a meeting is no longer the hard part. AWS, Google, OpenAI, and a dozen notetaker apps will turn audio into text well enough. The hard part — the part that makes a transcript actually useful — is two things the commodity tools do badly: knowing who said what, and handing you a document you can actually send to someone.

Botverse Transcribe does both, in a single call.

What it is

Point it at a video or audio file — any common format, MP4, MOV, WAV, M4A, MP3, and the rest. Hand it the list of people on the call. Get back a clean, speaker-labelled transcript. That's the whole interaction. One MCP tool call for your agent, or one line on the command line:

botverse transcribe all-hands.mp4 --to docx \
  --attendees "Sarah Chen, Mike Torres, Priya Patel"

Behind that one call, Botverse runs the entire pipeline so you don't have to: it converts the video to audio, runs speech-to-text with speaker diarization, uses AI to turn the anonymous "Speaker 1 / Speaker 2" labels into the real names from your attendee list, assembles a header with the date, duration, and participants, and renders the output in whatever format you asked for. Set it and forget it — submit the job and get on with other work while it runs.

What you get back

Pick the output that fits the job — or ask for several in one call:

txt — a clean, readable transcript with speaker names and timestamps
json — structured segments with start/end times, speaker label, resolved name, and a confidence score — for when an agent needs to work with the data
srt / vtt — caption files, ready to drop onto the video
docx / pdf — a properly formatted document with a titled header and bold speaker attribution, ready to attach to an email

The part nobody else gets right: who said what

Standard transcription engines discard the picture and work on the audio. They can tell different voices apart (diarization), but they have no idea whose voice is whose — you get "Speaker 0, Speaker 1, Speaker 2." Naming those speakers is exactly where a transcript goes from "a wall of text" to "the minutes of the meeting."

Botverse closes that gap with an AI naming pass. Given the attendee list, it reads the transcript the way a person would — picking up self-introductions ("I'm Mike, chairman of the advisory board"), how people address each other ("thanks, Devin"), and role and topic cues — and assigns each anonymous speaker a real name, with a confidence score and an honest "best-effort estimate" label. It never invents content; it only assigns identity.

We benchmarked it against a commercial notetaker

We ran it on a real industry call: 60 minutes, 17 participants, the usual cross-talk and overlapping introductions. We had a commercial AI notetaker's transcript of the same call to compare against. The results:

10 of 11 distinct speakers correctly named. Every participant who actually spoke and introduced themselves was identified — the host, the chairman, and presenters from each organization on the call.
Cleaner transcription. The commercial notetaker fragmented overlapping speech and mis-attributed echoes — splitting one person's sentence across two speaker labels. Our output grouped speech into coherent, correctly-attributed turns.
The attendee list fixed what raw ASR got wrong. Speech-to-text heard several names phonetically — the wrong spelling. The naming pass mapped them back to the correct spellings from the roster. That's the difference between a transcript that looks sloppy and one you can send to a client.
Better attribution. When the host read a roll-call of names, the notetaker created separate speaker entries for people who hadn't actually spoken. Our version correctly kept it as the host reading names.

Where it's honest about its limits: in a chaotic, overlapping introduction, two quiet voices got merged into one cluster — diarization is acoustic, and very brief or heavily-overlapped speakers are hard. We surface a confidence score precisely so you know which labels to trust.

Why we built it

The same reason we built the rest of Botverse: it's infrastructure an AI agent should be able to reach for in one call, not reimplement badly inline. An assistant asked to "transcribe yesterday's call and send me the notes" shouldn't be stitching together an ASR API, a diarization library, a naming heuristic, and a document renderer. It should make one call and get a named, formatted transcript. So should you, from the command line, when you just need the minutes of a meeting.

Pricing

About $5 for a 60-minute meeting — diarization and AI speaker-naming included, no add-ons to think about. Per audio-minute, billed on completion from the same prepaid wallet as the rest of Botverse. Credits never expire; no subscription, no minimums.

One call in, a named transcript out. Connect your agent or grab the CLI and get on with the rest of your day.

Ready to connect your agent to Botverse?

Set up in five minutes. No contracts, no minimums.

Get started