This could be improved by continuously recording and transcribing, making sure there’s never a time slice where only the transcription task is running. While processing, we are not capturing the audio played in the meantime so we’re dropping subtitles. This approach makes it easy to spot the limitations of our real-time transcription idea. transcribe (model, audio, task = "translate", fp16 = False ) ) □ load_audio ( "output.wav" ) ) print (whisper. Record_audio ( ) # read wav file into whisper fileĪudio = whisper. We’re passing in clips and hoping it works, so take this with a grain of salt.Ī naive approach to real-time transcription is setting the RECORD_SECONDS parameter of record_audio() as low as possible while receiving acceptable results from Whisper. Ideally, Whisper expects the full 30s of audio, including complete sentences. But if we cut off too much, we make it harder for the model. The closer we want to get to real-time transcription, the shorter our captured clips would ideally be. This is a really hard task for our model to accomplish, as it receives a very short slice of computer audio with no other context attached and has to detect the language, transcribe the content, and then translate the result. As a viewer, I wanted to get subtitles displayed in (almost) real-time. To come up with an acceptable solution, we have to think about what we want. We’ve got all our building blocks prepared so let’s assemble everything in one final step. Running our program again should now return the English translation for audio in any other language. transcribe (model, audio, task = "translate" ) ) □ To get the device index for your loopback device, you can open a Python REPL You might already be wondering about the parameters here, and we’ll get to them in a bit, I promise! Another important detail you have to double-check is input_device_index=2, which specifies the device to capture audio from. I installed Audio Hijack and Loopback to create a virtual loopback audio device, which I could then consume with PyAudio.įrames = for i in range ( 0, int (RATE / CHUNK * RECORD_SECONDS ) ) : the combined output of all applications and the system) without installing Kernel extensions or performing some magic that effectively does the same. MacOS has some quirks, including the fact that you can’t easily record computer audio (i.e. Let’s not get ahead of ourselves, though, and start by recording our computer audio. You might already be thinking that this fact alone would hurt the latency, and you’d be right, we’ll check that out in the end. That’s why I had to record and save audio, then load it into Whisper. Unfortunately, Whisper isn’t designed for handling streams, instead, it accepts audio files and processes them in a sliding 30-second window. To get started, I had to capture my Mac’s computer audio to pass it into Whisper. Ideally, I would pass in the audio stream in real-time and have Whisper transcribe and translate the content to English, detecting the language without any hints. The Any-to-English speech translation case was exactly what I needed. Interestingly, translation is not an afterthought but is embedded within the model, so you can either run a simple transcription or automatically translate the detected speech into English. Released in September 2022, Whisper is a model trained by OpenAI designed to recognize, transcribe, and translate speech in multiple languages. I hadn’t used it in the past, so there was some initial research and fiddling around until it worked, let’s check it out! What’s Whisper, anyway? Last night, I started watching a recent show which includes dialogues in multiple languages, so naturally, I wondered if I could use OpenAI’s Whisper model to transcribe and translate audio to subtitles in real time. If you can spare 30 minutes of your time, I'd love to chat with you! Just send me an email! Hey there □ I would love to learn more about your thoughts on onboarding software engineers and the challenges you're facing in your company.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |