Python Speech Recognition and Audio Transcription

Speech recognition is the process of automatically recognizing human speech. Speech recognition systems have a wide range of application, from human-computer interaction to automatic speech transcription. Personal assistants like Google’s Assistant, Amazon Alexa or Apple’s Siri, use speech recognition systems to recognize human voices and respond accordingly.

In this tutorial, we’ll show you how to perform speech recognition with Python. You’ll see how to recognize human speech from live microphones as well as from pre-recorded audio files. Recognizing human speech from audio files using Python is particularly powerful since it lets you automatically transcribe recordings for free.

Installing Required Libraries

You need to install two libraries before you can perform speech recognition in Python. The first one is the SpeechRecognition library and the other is the PyAudio library. Execute the following script on your command terminal to install these libraries:

pip install SpeechRecognition
pip install PyAudio

If you’re running Windows, you may need to install PyAudio directly from the wheel for your version of Python.

Speech Recognition from Microphones

In this section, we’ll show you how to recognize speech from live microphones. In the next section, we’ll demonstrate how to recognize speech from audio files, which is fantastic for transcribing recordings or interviews.

As a first step. Execute the following script to import the SpeechRecognition library.

import speech_recognition as sr

Since you will be recognizing speech from a microphone, you can print the names of all the microphones attached to your system using the following script:

sr.Microphone.list_microphone_names()

Depending upon the microphones attached to your system, you may see a different output than the one below:

Output:

['Microsoft Sound Mapper - Input',
 'Microphone (USB PnP Audio Devic',
 'Microphone (Realtek High Defini',
 'Microsoft Sound Mapper - Output',
 'Speakers (Realtek High Definiti',
 'Microphone Array (Realtek HD Audio Mic input)',
 'Speakers (Realtek HD Audio output)',
 'Stereo Mix (Realtek HD Audio Stereo input)',
 'Headphones ()',
 'Microphone (USB PnP Audio Device)']

To recognize speech from a microphone, you first have to create an object of the Microphone class of the SpeechRecognition module as shown in the following script:

microphone = sr.Microphone()

Next, you have to create an object of the `Recognizer class from the SpeechRecognition module.

recognizer = sr.Recognizer()

The Recognizer class contains methods that use different underlying APIs for speech recognition. To print methods and attributes of the Recognizer class, you can pass the Recognizer class to the dir method as shown below:

dir(sr.Recognizer)

Output:

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'adjust_for_ambient_noise',
 'listen',
 'listen_in_background',
 'recognize_api',
 'recognize_bing',
 'recognize_google',
 'recognize_google_cloud',
 'recognize_houndify',
 'recognize_ibm',
 'recognize_sphinx',
 'recognize_wit',
 'record',
 'snowboy_wait_for_hot_word']

You can see that the Recognizer class contains methods such as recognize_bing() which uses Bing Speech API, recognize_google() which uses Google Speech API, recognizer_ibm() which uses IBM speech API, etc. In this article, we’ll be using the Google Speech API since it doesn’t require an API key or login credentials to use.

To this point, we’ve created an object of the Microphone class and an object of the Recognizer class. Now, we are ready to recognize speech from our microphone. Simply execute the following script to start your speech recognition:

with microphone as micro_audio:
    print("Start Speaking ...")

    recognizer.adjust_for_ambient_noise(micro_audio)
    audio = recognizer.listen(micro_audio)
    
    print("Converting your speech to text...")
    print("Did you say: " + recognizer.recognize_google(audio) + "?")

In the above script, the object of the Microphone class (micro_audio) is passed to the adjust_for_ambient_noise() method which removes the noise from speech. Next, the micro_audio object is passed to the listen() method which returns an AudioData object, audio. The audio object is passed to the recognize_google() method which recognizes the speech from your microphone and returns the text of the speech. Execute the above script and say something into your microphone. once you stop talking, the text of your speech will be printed in the output on your screen.

Here is an example output. Naturally, yours will be different depending on what you spoke into your microphone.

Output:

Start Speaking ...
Converting your speech to text...
Did you say: hello how are you and what are you doing where have you been for the last 3 days?

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Transcribing Text from Audio Files

Transcribing text from audio files is quite similar to speech recognition from a microphone. You can use the Recognizer class’s recognizer_google() method to do so. However, in this case you need to create an object of the AudioFile class, instead of the Microphone class. The path to the audio file that contains speech is passed to the constructor of the AudioFile class as shown below:

audio_data = sr.AudioFile('E:/audio_file.wav')

Next, you need to create an object of the Recognizer class, just like we did before.

recognizer = sr.Recognizer()

After that, the object of the AudioFile class, file_audio, is passed to the adjust_for_ambient_noise() method which removes the noise from the speech stored in your audio file. Then, the file_audio object is passed to the record() method of the recognizer object which returns an AudioData object, audio. The audio object is passed to the recognize_google() method which recognizes the speech from your audio file and prints the transcription to your screen:

with audio_data as file_audio:
    print("Start transcribing file ...")

    recognizer.adjust_for_ambient_noise(file_audio)
    
    audio = recognizer.record(file_audio)
    
    print( recognizer.recognize_google(audio))

Output:

Start transcribing file ...
every word and phrase he speaks is true he put his last cartridge into the gun and fired they took their kids from the Public School Drive the screw straight into the word keep the tight and the Watch constant several the twine with a quick snip of the knife paper will dry out when wet slide the catch back and open the desk help the week to preserve their strength it's Allen Smile gets few friends

It’s very important to mention the Speech Recognition library only supports WAV files. You’ll get an error if you try converting an mp3 file. Convert your mp3 files to WAV format using any online audio file converter, and you won’t have a problem.

If you don’t want to begin transcribing your file at the very beginning, you can set the offset time in seconds. The speech from your audio file will be recognized after the offset seconds. You can also set the duration for which you want to recognize speech from audio file. The offset value is passed via the offset attribute while the duration is passed via the duration attribute of the record() method, as shown in the following full script.

import speech_recognition as sr
audio_data = sr.AudioFile('E:/audio_file.wav')
recognizer = sr.Recognizer()
with audio_data as file_audio:
    print("Start transcribing file ...")

    recognizer.adjust_for_ambient_noise(file_audio)
    
    file_audio = recognizer.record(file_audio, offset = 5, duration = 10)
    
    print( recognizer.recognize_google(file_audio))

The above script will start speech recognition after seconds of audio file and will recognize speech for a total time of 10 seconds.

Output:

Start transcribing file ...
he put his last cartridge into the gun and fired they took their kids from the Public School Drive

We publish lots of tutorials like this to help you get the most out of Python. If you liked this speech recognition tutorial, subscribe using the form below and we’ll let you know when new Python tutorials you might be interested in are published.

Get Our Python Developer Kit for Free

Python Speech Recognition and Audio Transcription

The Python Tutorials Blog

Installing Required Libraries

Speech Recognition from Microphones

Transcribing Text from Audio Files

About The Python Tutorials Blog