Using Youtube-DL and CMU Sphinx Speech Recognition to transcribe youtube audio files.

Posted by Renato Recio on May 14, 2016

Audio Transcription:
Use youtube-dl and sphinx to turn your audio files into transcribed text output.

This tutorial will go over the steps required to download and transcribe audio files from youtube (or anywhere else).

I had the idea come to me when I realized that I had to jump through 45 minutes of youtube to find a specific keyword, and jumping around on the slider wasn't getting me very far. Unfortunately, this transcription isn't an end all solution - the speech recognition still isn't fully there yet, and the actual processing of the audio takes time. But overall,you'll find it's not that bad and it's also interesting. In fact, Google uses their own speech recognition to implement CC when viewing youtube videos on a chrome browser.

So let's go ahead and get the tools we need for the job. For Ubuntu 14.04, these are the steps required:

sudo apt-get install -y libav-tools
sudo apt-get install swig
sudo pip install pocketsphinx
sudo pip install youtube-dl
sudo apt-get install python-pyaudio

Now that you have the proper packages installed, we can build out a simple application to utilize them.
import speech_recognition as sr
import argparse
import subprocess
import os

def download_video(url):
    FNULL = open(os.devnull, 'w')
    ydl = subprocess.Popen('youtube-dl {url} -o "youtube_audio.%(ext)s" '
                           '--audio-format wav --extract-audio'.format(url=url), stdout=FNULL, shell=True,
    # Wait for it to finish downloading
    print 'Downloading youtube video...'
    print 'Download complete!'
    return open("youtube_audio.wav")

def read_video(file_name):
    print 'Reading youtube wav file'
        r = sr.Recognizer()
        with sr.AudioFile(file_name) as source:
            audio = r.record(source)
        output = r.recognize_sphinx(audio)
    except IOError as exc:
        output = 'Unable to find the audio file.'
    except sr.UnknownValueError:
        output = 'Error reading audio'
    return output

def main(args=None):

    # Download the file
    if args.youtube_url:
        file_name = download_video(args.youtube_url)
    elif args.file_name:
        file_name = open(args.file_name)

    # Read the file
    transcription = read_video(file_name)

    # Print the transcription
    print 'Finished reading wav file. \n\nTranscription:'
    print transcription

    # Delete the file

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--youtube-url", help="Enter the full url of the youtube video to read from.")
    parser.add_argument("--file-name", help="Enter the file name of a wav file to read from.")
    args = parser.parse_args()

Now for a little explanation...

Basically, the program has two main steps.

One, it takes an argument passed by the command line called --youtube-url. With this url, it will use youtube-dl to extract and download the audio to your local directory. In addition, you can use your own wav file and pass that in as an argument rather than youtube-dl.

Once that is finished, we will use Sphinx speech recognition to open the wav file and read it. Speech recognition will generate a string that we will then print to stdout and then delete the file and terminate the program.

And there you have it! We now have a simple program that combines two very cool technologies and generates a transcription from audio! Keep in mind, the speech recognition is obviously not perfect. If you are interested, there are also different types of speech recognition tools offered by Google and IBM, which sr has built into its library. You can find more information on that here.

Thanks for reading :)