iOS SFSpeechRecognizer – On Device Recognition

iOS 13 showcased Apple’s advancements in the field of Machine Learning and Artifical Intelligence. One such feature that sheds light on their ambitions is Speech Recognition. On-Device Speech Recognition not only eliminates server processing but also strives to maintain user privacy. By allowing Siri as well as developers to tap into Offline voice recognition, Apple aims to give Voice-based AI a major boost.

The newly upgraded Speech Recognition API lets you do a variety of things like Voice Analysis by tracking the voice quality, average pause time, speaking rate, confidence level, and live transcriptions. From providing automated feedback based on recordings to comparing the speech patterns of individuals there’s so much you can do by tapping in this technology.

Of course, there are certain trade-offs to consider with this new On-Device Speech Recognition. There is no continuous learning like you have it on the server. This can lead to lesser accuracy on the device. Moreover, the language support is limited to just 10 languages currently.

Nonetheless, on-device support lets you do speech recognition for an unlimited amount of time. A big win over the previous 1 minute per recording limit the server had.

In the next sections, we’ll be developing an on-device speech recognition and transcription iOS application. We’ll be skipping the UI and asthetics of the application. You can find them in the full source code available at the end of this article.

Let’s tap into the microphones!

Implementation

For starters, you need to include the privacy usage description for microphone and speech recognition in your info.plist as shown below. Not doing so would certainly lead to a runtime crash:

Next, import Speech in your ViewController class to get started!

Requesting Permissions

We need to request authorization from the user in order to use Speech Recognition. The following code does that for you:

SFSpeechRecognizer.requestAuthorization{authStatus in

            OperationQueue.main.addOperation {
               switch authStatus {
                    case .authorized:
                    default:
                        print("none")
               }
            }

        }

Next, the SFSpeechRecogniser is responsible for generating your transcriptions through SFSpeechRecognitionTask. For this to happen, we must initialise our SFSpeechRecogniser as shown below:

var speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en_IN"))

In the above code, you need to pass the locale identifier that’s used throughout your phone. In my case, it was English(India) that I’d chosen during Apple ID Setup.

Setting Up Your Audio Engine

The AVAudioEngine is responsible for receiving the audio signals from the microphone.

let audioEngine = AVAudioEngine()
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
try audioSession.setActive(true, options: .notifyOthersOnDeactivation)

let inputNode = audioEngine.inputNode

inputNode.removeTap(onBus: 0)
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
            self.recognitionRequest?.append(buffer)
        }
        
audioEngine.prepare()
try audioEngine.start()


The above code installs a tap on the input node and sets the buffer size of the output. Once that buffer size is filled(through your audio signals as you speak or record), it is sent to the SFSpeechAudioBufferRecognitionRequest.

Now let’s see how the SFSpeechAudioBufferRecognitionRequest works with the SFSpeechRecognizer and SFSpeechRecognitionTask in order to transcribe speech to text.

An illustration of how everything works together should be good right now:

ios-speech-recognition-avaudioengine-flow

Setup On Device Speech Recognition

if #available(iOS 13, *) {
            if speechRecognizer?.supportsOnDeviceRecognition ?? false{
                recognitionRequest.requiresOnDeviceRecognition = true
            }
        }

Setting requiresOnDeviceRecognition to false would use the Apple server for speech recognition.

Setup Recognition Task

recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest) { result, error in
            if let result = result {
                DispatchQueue.main.async {
                    
                        let transcribedString = result.bestTranscription.formattedString
                        self.transcribedText.text = (transcribedString)
                }
            }
            
            if error != nil {
                // Stop recognizing speech if there is a problem.
                self.audioEngine.stop()
                inputNode.removeTap(onBus: 0)
                self.recognitionRequest = nil
                self.recognitionTask = nil
            }
        }

result.bestTranscription returns the transcription with the highest confidence.
In order to get the text we invoke the formattedString property on it.
We can access other properties such as speakingRate, averagePauseDuration or segments.

Segments are used for voice analytics metrics. SFVoiceAnalytics is the newly introduced class that contains a collection of voice features for tracking such as pitch, shimmer, jitter as shown in the below snippet:

for segment in result.bestTranscription.segments {
                    guard let voiceAnalytics = segment.voiceAnalytics else { continue }

                    let pitch = voiceAnalytics.pitch
                    let voicing = voiceAnalytics.voicing.acousticFeatureValuePerFrame
                    let jitter = voiceAnalytics.jitter.acousticFeatureValuePerFrame
                    let shimmer = voiceAnalytics.shimmer.acousticFeatureValuePerFrame
}

Start Recording And Transcribing


private let audioEngine = AVAudioEngine()
    private var recognitionRequest: SFSpeechAudioBufferRecognitionRequest?
    private var speechRecognizer = SFSpeechRecognizer(locale: Locale(identifier: "en_IN"))
    private var recognitionTask: SFSpeechRecognitionTask?

func startRecording() throws {

        recognitionTask?.cancel()
        self.recognitionTask = nil

        let audioSession = AVAudioSession.sharedInstance()
        try audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
        try audioSession.setActive(true, options: .notifyOthersOnDeactivation)
        
        let inputNode = audioEngine.inputNode
        inputNode.removeTap(onBus: 0)
        let recordingFormat = inputNode.outputFormat(forBus: 0)
        inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
            self.recognitionRequest?.append(buffer)
        }
        
        audioEngine.prepare()
        try audioEngine.start()
        
        recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
        guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
        recognitionRequest.shouldReportPartialResults = true

        if #available(iOS 13, *) {
            if speechRecognizer?.supportsOnDeviceRecognition ?? false{
                recognitionRequest.requiresOnDeviceRecognition = true
            }
        }

        recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest) { result, error in
            if let result = result {
                DispatchQueue.main.async {
                        let transcribedString = result.bestTranscription.formattedString
                        self.transcribedText.text = (transcribedString)
                }
            }
            
            if error != nil {
                self.audioEngine.stop()
                inputNode.removeTap(onBus: 0)
                self.recognitionRequest = nil
                self.recognitionTask = nil
            }
        }
        
    }

Here’s a screengrab from the application we just developed!

ios-speech-recognition-transcription-live

That sums up this article on Speech Recognition and Transcription in iOS 13. The full source code of this project is available in the Github Repository.

Leave a Reply

Your email address will not be published. Required fields are marked *