I Built a Tool That Makes Supercuts in Seconds
Chapters7
An overview of the super cut concept and how the tool stitches together clips from transcripts.
Wes’s Super Cut Studio and a Grock-powered AI pipeline turn transcripts into fast, automated video super cuts with FFmpeg, showcasing a clever blend of AI, CLI, and front-end polish.
Summary
Syntax host Wes demonstrates a bold toolchain that auto-creates “super cuts” by stitching together AI-generated transcripts with video clips. The system relies on Wes’s Super Cut Studio, which analyzes transcripts to pull and assemble clips, all orchestrated by a Grock-backed transcription API and FFmpeg. The idea seeded when Peter Cooper experimented with Grok’s speech-to-text API to find every instance of the word CSS across 132 Syntax episodes, leading Wes to build an efficient pipeline that downloads transcripts, labels speakers, and returns flexible search phrases (reaxes). Wes notes Grock’s accuracy and speaker diarization as key advantages over Whisper, and he highlights cost: roughly 10 cents per hour of video. The workflow emphasizes upfront processing—transcripts and video assets are parsed first—so subsequent super cuts can be generated quickly, often in under a minute. The UI evolves from CLI-driven experiments to a polished interface, with features like tile mode for clustering clips, sentence and songify for quick phrase-based outputs, and AI-driven reax generation. Along the way, Wes shares practical tips about FFmpeg usage, such as stitching together clips via a single command rather than re-encoding each segment. The episode blends engineering decisions with creative experimentation, including live-demo prompts that reveal what makes this approach feel fast, flexible, and fun to use.
Key Takeaways
- Grock’s speech-to-text API provides word-level transcripts and speaker diarization, enabling accurate labeling of who said what across a set of videos.
- ]} ,
- target_audience
- Essential viewing for developers and video editors curious about building AI-assisted editing pipelines; it shows pragmatic trade-offs between accuracy, cost, and speed when turning long-form content into bite-sized cuts.
- topics
- ["FFmpeg","Super Cut Studio","Grock","XAI","Deepgram","Whisper vs Grock","Speaker diarization","Transcript-driven editing","Tile mode","Reax","Sentence and Songify","Spelt Kit"],
Who Is This For?
Essential viewing for developers and video editors curious about building AI-assisted editing pipelines; it shows pragmatic trade-offs between accuracy, cost, and speed when turning long-form content into bite-sized cuts.
Notable Quotes
"This thing is wild. And we're going to be asking Wes all about how he built it."
—Opening tease about Wes's tool and the build focus.
"The pricing on it is 10 cents per hour of video."
—Cost efficiency of Grock-based transcripts vs alternatives.
"FFmpeg has the ability to stitch things together given inputs and you can use the start and stop..."
—Core FFmpeg technique enabling fast, concatenated cuts.
"Tile mode... every four to six and it will try to figure out what are the closest versions to sort of line up with them."
—Explains the tile mode feature for assembling clips.
"Built in spelt kit and ffmpeg. Pretty fun project to build, especially when you make those hilarious cuts of everything we've said."
—Summary of the tech stack and the humorous creative payoff.
Questions This Video Answers
- How does Grock's transcription API improve accuracy over Whisper for video transcripts?
- What is tile mode in a super cut workflow and how does it choose clip timing?
- Can you build a fast AI-assisted editor using FFmpeg from transcripts to video cutouts?
- What are reaxes and how do you write flexible search patterns in transcript-driven editing?
- Why would you use a CLI-first approach before a UI when building a tool like a super cut studio?
FFmpegSuper Cut StudioGrockXAIDeepgramWhisper vs GrockSpeaker diarizationTranscript-driven editingTile modeReax (reaxes) practice on transcripts
Full Transcript
made this really cool spelled super cut videos together with AI transcripts in the ffmpeg. That's great. That clip was just made with Wes's Super Cut Studio app that he made to analyze transcripts and videos to pull out and cut and then assemble them into a super cut. This thing is wild. And we're going to be asking Wes all about how he built it. Yes. So, this actually started right before episode 1000 and Peter Cooper, who runs JavaScript Weekly, was testing out the new Grock speechtoext API. It's basically a transcription and and he used one of our syntax episodes to basically find every single time that we said the word CSS.
And I was like, that is a fantastic idea. So, what I did is I downloaded 132 of our last videos and with XAI, they give you word level transcripts. This is kind of what happens after you've to download all of them. I used I actually can't say what I used because I got a strike on my own personal YouTube channel for talking about it once, but I found a way to procure our own episodes and it's extremely accurate. Like, watch this. We started with 16 web developers and the accuracy is super high and it is extremely cheap to do.
And you did this via you said via XAI. How how did you get access to that API? Was it through something like Open Router? No, you just sign up, get a Okay. So, you didn't use Open Router for this? Put a couple bucks in the machine. No, I didn't didn't use Open Router. I just uh straight up through right through the API. Do you know how much it costs like per video or how much your total bill was for all these transcripts? Yeah. So, the pricing on it is 10 cents per hour of video. Like we use deep gram for our podcast transcripts and we find that to be extremely affordable.
Um, and this is like three times cheaper and it is so good. And I know a lot of people are like, why not just use Whisper or whatever? Whisper's great, but it's well, it's not great. It's It's not nearly as accurate as this and it's not nearly as fast as this because if you're trying to run it locally. Um, and the other thing that this does is dietization, meaning that it will actually tell you who the speakers are. So, if you wanted to be able to like label each of your speakers, you could do that with it.
So, I was like, this is actually kind of nifty and not a lot of people are talking about this new API. What are what are some of your favorite super cuts you've made so far? This is crazy. That was crazy. That sounds dumb. This is the stupidest stupid nuts. This is the stupidest. This is stupid. It was slow. It is ridiculous. You are a This is so stupid. It is so stupid. This is a dumb. That is crazy. I mean, I love that though because that now my developer brain is thinking and it's like that's not just like text.
That's regular expressions. Like this is some word. This and it's it's finding those and putting those together. That's really cool. Incredible. Incredible. Incredible. Incredible. Incredible. Incredible. Yeah. And so sometimes we see the the grid. Is that m That's multiple clips. Yeah. So the way that it works is like it obviously transcribe it and you can go to the transcribe tab and just click on transcribe and that will go off and upload it to Grock. It also has ability to fall back to MP3 because it Grock limits at 500 megs. So it just converts it to MP3 if that's the case.
But then on the super cut page here, you can give it a phrase, a reax, or you can the one thing I had problems with was actually writing the reaxes that would be flexible enough because you can search for that is unccredible, but then you also want that's incredible or you're incredible or that's so incredible, you know? So being able to write all these reaxes. So what I did is I added an AI plain English one that simply just is a tool call and returns a reax that is a little bit more flexible and it it works really well.
So, the tile mode that you're talking about, it will every between, in my case, every four and six um Incredibles or every four and six clips, it will try and insert like a comedic the them saying the same words and it will try to figure out what are the closest versions to sort of line up with them because sometimes you would say like that's stupid and then sometimes you may say that's stupid really fast. So, it'll try to find the ones that are similar length. Um, like there's here tile minimum min duration, max duration, and it'll just kind of I had to turn these knobs quite a bit to actually make it sound cool.
But once I did get them turned in the right direction, then it started to like actually feel really good. Nice. This is so cool. I got to ask you something. Uh, uh, none none about the transcriptions or the generation part of this. How did you design this thing? cuz it looks cool as hell, man. This was mostly mostly AI designed. Um, but I did I prompt it in in a specific way. I don't I don't know if I did anything special here. I just said like bold letters. I went and picked the font. Um, I gave it a couple examples of like underline.
Um, and I went in and just changed a few things and then I said, "All right, apply this to the rest of the website." Wow. You know, the font is like probably 80 90% of this cuz like I really like the e the cut in on the e. Yeah, I think that's what it is. Yeah, that's good. Here, let's try this thing. Tile mode, run super cut. So, what it does is it will go through every single transcript. You see it's parsing through all of the different transcripts and then it's finding anything that matches the phrase re or reax and then in this case it found 100 and then it's tiling them every four to six and it goes and clips them out with ffmpeg.
So, it's kicking off a little, this whole thing is built in spelt, but it kicks off a child process um on the CLI to ffmpeg and it basically cuts out of the webm videos that I downloaded. It cuts the start and stop value plus some there's ability to make it longer or shorter um in the code, not in the UI. I built most of this in the code. Um and I just ran it from the CLI CLI. Do people even say CLI? I ran it from the command line and then at at the very end I just built like a little I said, "All right, I'm this is working now.
Build out the the interface to it." And I find building stuff like that is a lot better than trying to mix and mash like UI with actual functionality. Let's see what we got here. thing. This thing. There are some really funny ones where I I search like uh ugg um we should show how long it took to generate that. Like don't cut it because that was really fast. And I think it also comes down to to just like the the engineering decisions made to do this cuz like if you're designing this you may not think oh I need to get all of the videos first, get all the transcripts first and then create the super cut.
That's what makes it so fast is you you do that upfront processing and now it's kind of just ready to go. We can prompt this for anything and it literally creates the clip in like under a minute which is really cool. The crazy thing is that it's working on file system so it has to load hundreds of JSON files first and then parse through them. That's probably the slowest part. If this was in like a SQLite database it would be even faster. Yeah. Um um um um um um uh can we run Prime through this?
I want to see him say a bunch. All right, so we'll go to Prime and simply just search for Vim. Yeah, it'll tell us. Here we go. It says there's 40 hits of him saying Vim um in estimated 8 and 1/2 seconds. So, let's let's go ahead and run that here. While that's running, let me do another one with the AI as well. It's not as fast. Super cut AI. Okay. So, like what's something funny that is dumb, stupid, silly, etc., right? And we'll do generate reax. Again, I'm going to speak to the engineering of this.
You're using background threads because so like you're able to ask for a new one while the other one is still generating. It doesn't freeze up the UI. That's another thing to think about. Yeah. Um, I'm using Grock high reasoning here, which is probably not a good idea cuz like this is taking forever to just generate a reax. Like this would be a a quick call. Um, you don't need a a high model to make a reax. Yeah, for sure. There we go. Look at this. Look at this reax that it has here. Okay, good. I've reviewed it.
Let's see. Dry run. Nine hits. Okay. Uh, let's go ahead and run it for prime. Yeah. Okay. Let's go back to the other one to see. There we go. You guys ready? Yes. Amazing. We have We have to We have to share this with him. This is so good. Like that's amazing. Oh, that's great. Let's see what how Oh, this one's done already. That's kind of crazy. That's crazy. This looks crazy. That's an insane. That's kind of an insane. That's how stupid. That's insane. That's crazy. That's kind of a crazy. It's good. Insane. Another feature we have here is sentence and songify.
So sentence will allow you to simply just take like I love to listen to the syntax podcast and Wes is my favorite. I don't know if he's ever said Wes. Build sentence. So, it found I love to listen to the syntax podcast. No clip for podcast. Dang. Got to get more more clips in there. Yeah. I also built a songify which is not very good. But essentially what it does is it finds the word and then it will find the clips and try to match the tone. Does it do any pitch shifting or it's just using an existing pitch?
It does both. It it tries this one's trying to detect the actual pitch. Oh, it's actually pretty good. Can you copy paste that melody multiple times so we get like a uh Yes, but it'll re it'll reuse the So like what we essentially need is a melody of like Twinkle, Twinkle, Little Star. Vim voom. All right, here here we go. Here's Twinkle, Twinkle, Little Star of Prime saying Vim. I couldn't hear it at all. I couldn't hear. You couldn't hear it at like I could hear I I couldn't hear the melody like it wasn't okay.
Yeah. Can you talk a bit about the ffmpeg low? So are you cutting clips, writing them to disk and then concatenating or is there a way to do it man like basically pass in a list of things you need and it FFmpeg handles it? Yeah, FFmpeg has the ability to stitch things together given inputs and you can you can use the start and stop and that's why it's so fast is it doesn't need to like first clip it clip it. It literally just say grab these second durations. So it generates an FFMP peg command that's like really long of like every single clip with start end in it.
Oh wow. Exactly. And I have made so many FFmpeg projects in the past and I've used every single FFmpeg adapter and nothing beats just straight up FFmpeg commands that get pasted into the console and run. Especially AI is so good at knowing all the presets for them. Yeah. Say goodbye. One thing I want to hook it up to is the internet archive has the ability to search old like news clips and you can search for like something funny like uh murderer or I did one one video once versus JavaScript. Yeah, probably not murderer, but your your example of something hilarious is murderer, but the the transcripts that Internet Archives only it gives you are like sentence base, right?
So, it would be cool to like download those clips and then clip right to the exact word that you're looking for. Um, it's it's a meme I think on like local news shows they'll be like, "Can't believe it's May already." And they just super cut like every single news channel. Like I can't believe it's May already. I can't believe it's May. So that that could be All right, you guys ready for the Let's do 2x playback speed. By the way, by the way, by way by way by the way, this is the O YouTuber. By the way, by the way, by the way, I'm still upset about doing a live stream.
By the way, by the way, by the way, by the way, by the way, by the way, by way. By the way, by the way, by way, you somehow stumbled on one of the things he says the most. That's awesome. Yeah, that's great. Yeah. Nifty. Nifty. Nifty. Nifty. Nifty. Nifty. Nifty. Nifty. Nifty. CJ, you said nifty. I said it once. Yeah. I don't think I've ever said nifty. CJ's saying nifty when he's talking about a a math modulus function. It's pretty nifty. I want to see just like a one or two second clip with a grid of like 100.
So like everyone just saying the same thing at once like a grid. Can Can we change the code? Tile tolerance. Here we go. 89. And that's only number four of 13. So it's going to find a few probably a few more. Oh, you're right. I shouldn't have limited it to 100 segments. Solo, solo, solo. I think the rest will be solo. Oh, 88. Yes. Go. Are you guys ready? Be doing a live stream. By the way, by the way, by the way, this is the O dev YouTuber. Why? By the way, by the way, I'm still upset about reconnecting websockets.
Okay. It's been by the way, by the way. By the way, by the way, by the way, by the way, that's it. I'll post a code up on GitHub if you want to take a peep at it. It's built in spelt kit and ffmpeg. Pretty fun project to build, especially when you make those hilarious cuts of everything we've said. And now I'm secondguessing every word I'm saying.
More from Syntax
Get daily recaps from
Syntax
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









