How to Transcribe Audio with Scribe v2 — Speech to Text API

ElevenLabs| 00:06:24|Mar 26, 2026

Chapters6

11 Labs' latest speech-to-text model supports key term prompting to recognize specific words, identity detection for extracting data like phone numbers, and speaker diarization for up to 48 speakers, along with dynamic audio tagging and other features.

ElevenLabs’ Scribe V2 nails speech-to-text with key-term prompting, multi-speaker diarization, and solid API examples you can drop into any app.

Summary

ElevenLabs unveils Scribe V2 as part of its best-in-class speech-to-text suite, showing off features like key-term prompting, identity detection, and speaker diarization for up to 48 speakers. The video walks through practical setup steps: obtaining an API key, storing it in an environment variable, and installing the 11labs library (TypeScript example using PNPM). The host demonstrates turning an MP3 with a soft American accent into a transcription, then dives into the transcription payload—language code, word-level timestamps, and per-word probability scores. They explain how to extract clean text and how to enable detectors for entities (personal information, offensive language) and other data types. The demo touches on front-end versus server-side concerns, emphasizing keeping API keys on the server, and ends with a concrete Astro-based project example where the transcription results populate a UI with entities highlighted. If you want a drop-in workflow for building transcripts with precise entity detection, this video lays out the concrete steps and code structure you’ll reuse. - Narrated guidance from 11 Labs showcases practical integration in Node.js/TypeScript and a deployment mindset for production apps.

Key Takeaways

Obtain an API key from the ElevenLabs developer panel and enable at least the Speech-to-Text permission to start using Scribe V2.
Install the 11labs library via PNPM and manage environment variables with a .env approach for securely storing the API key.
Convert an MP3 into a Blob and pass that binary data to the Scribe V2 API to receive a detailed transcription object with word-level timestamps.
Use key-term prompting to recognize brand names (e.g., 11 Labs) and identity data (e.g., phone numbers) within transcriptions.
Leverage diarization to identify up to 48 speakers and entity detection to extract personal information or other specified entities.
Extract clean text from the transcription by accessing the text property, separate from the full payload.
Run the server-side Astro example to keep API keys on the server and expose only safe results to the client.

Who Is This For?

Developers and product builders who want to add robust speech-to-text with speaker diarization and entity detection into their apps, especially those using Node.js/TypeScript and Astro for server-side rendering.

Notable Quotes

""11 Labs just released the best speech to text model in the world. But instead of talking about how good it is, let me just show you.""

—Opening claim emphasizing hands-on demonstration of Scribe V2 features.

""It also supports speaker diorization for up to 48 speakers which means you can associate and identify 48 different people.""

—Highlighting multi-speaker support.

""So for at minimum, you need the speech of text here. But depending if you're just learning, I just do all of them.""

—Advising API permission setup for quick start.

""This is just a regular MP3 file of someone talking with a soft and whispery American accent.""

—Describing the sample input used for transcription.

""You can extract just the text by calling the property and this way you'll get just a clean text output just like this.""

—Demonstrating text extraction from the transcription payload.

Questions This Video Answers

How do you set up an ElevenLabs Scribe V2 transcription workflow in a Node.js project?
What is speaker diarization and how does it work with Scribe V2 (up to 48 speakers)?
How can I enable key-term prompting and identity detection in ElevenLabs speech-to-text API?
What are best practices for securely handling API keys when building a transcription feature?
Can I run ElevenLabs Scribe V2 with Astro or other SSR frameworks and keep keys server-side?

ElevenLabsScribe V2Speech to TextKey-term PromptingIdentity DetectionSpeaker DiarizationEntity DetectionPNPM TypeScriptNode.js server-side apps

Full Transcript

11 Labs just released the best speech to text model in the world. But instead [music] of talking about how good it is, let me just show you. It supports key term prompting so you can configure the model to recognize specific words, for example, brand names like 11 Labs and identity detection so you can extract correct information like phone numbers. 7738881989. It also supports speaker diorization for up to 48 speakers which means you can associate and identify 48 different people. Has dynamic audio tagging and much more. And all this can be built into your products using the 11 Labs libraries. In this video, I'll show you how you can start transcribing audio with Scribe V2. So, it all starts with the API keys. In order for you to connect to 11 Labs, you need to go to your 11 Labs account and click this little developers icon over here and click create an API key, give access to all the things you need. So for at minimum, you need the speech of text here. But depending if you're just learning, I I just do all of them. Then you can create your key and start working with it. So then once you have it, copy paste that value and add it to your environment variables. So your ENV file and it should look something like this. Make sure you call it 11 Labs API key. Then you need to install the 11labs library using your favorite package installer depending on which language you're using. Here we're going to use TypeScript. So I'm using PNPM to install 11 Labs. Here it shows I already had the but if you don't have it, make sure you do ppm install env. This will make it easier for you to manage your environment variables. So example we're going to create here is just a simple backend function using Node.js that will take a audio file and convert it to your uh speech to text to a transcription. and it's going to use Scribe V2. It'll play all the settings and I want to give a really simple example because once you understand it, you can apply it to whatever project you want to. So to get it all started, we need to create an 11 Labs client. Now, since we named our API key, I think 11labs API_key. This 11labs component will automatically just bring it in using thevconfig. Now to start transcription, we need to have a file. There's a public file. You can find this in the docs of Nicole talking. This is just a regular MP3 file of someone talking with a soft and whispery American accent. And once we have that file, we have the MP3. We want to turn it into a blob, which is a binary large object, I believe. [music] Uh and this is basically basically the root type of what a file is which is audiompp3 and it takes the buffer of just bits and bytes and creates the type of binary large object which is a file. Now to get to the transcription from this file we use the 11labs client we created we use the speech to text and we convert this file using our scribe v2 model and we can output this directly to our console. So let's see how this looks. It's calling the file. It's using calling the API with the file that we have. And here's the whole transcription object that we get. So notice there's multiple things that you get. So first you get the language code which is English. Here it says the probability is it's pretty certain that this is English and the transcription is correct with a soft and whispery American accent. I'm the ideal choice for creating ASMR content, meditative guides, or adding intimate feel to your narrative projects. This is very cool. But you also get all the words broken down for you in a nice uh array. So here it tells you the start of each word. So we have with then we have a space we have a then a space and soft tells you exactly the time stamp when this word started when to when it finished the type of this word and the text. Now while I was recording I was curious what log prop is and it's the pro log of the probability with which the word is predicted. So basically how accurate they uh 11 believes this word was predicted and zero is the highest. So if we look at our data it looks like all of these were predicted with a very high probability. You can extract just the text by calling the property and this way [music] you'll get just a clean text output just like this. I talked about a handful of features that 11 has for example you have the key terms you have the identity detection multi- channelannel diorization is somewhere up here and dioriz is just a true or false so by assigning this flag you get diorization up to 48 speakers entity detection can take either a string so we can just do pi of personal information people's personal information or you can pass an array of the different types of entities you want to detect so for example offensive language personal information and other types. And that's the basic Subscribe V2. It's a very simple API and really powerful once you put it into your applications like I did in the app that I showcased in the introduction. That was an Astro application, which is my favorite, I guess, web framework. This application was serverside rendered. So, I made sure my API is on the server. Make sure that's not exposed to the client because you don't want your API keys in the client. And this is the speech to text. And you can see it does the exact same thing here. We create a client and we convert. We detect the uh personal information like phone numbers. We used the V2 file. This it's the exact same code we have. All I did here is add add a bunch of checks to make sure that everything works nicely. Now what I did here on the front end is I tap to record. But remember we're taking in a file. So I take that recording that we just created. I create a blob, a binary large object file from it and I pass this to the back end the API speech to text back end and it gives me the transcription back and then and then what I did is display nicely in this list and extract whenever you get the entities you get a list of entities. So I guess may let me show you that as well. If you are trying to identify entities I have this console log with transcription of the entities and these are the detected entities which was the phone number. So on the front end I take this I convert it to uh numbers and you can have a cool looking application like this and and the ideas I have what you can do with this are infinite. So if you're looking to get started with transcription I recommend to go to 11labs.io Create an account, get an API key, and start building with Scribe V2.