I have used the Google Cloud Vision API to transcript the audio file and extract the text from the image.