iOS 13 Vision Text Recognition with Document Scanner

Previously, we had used Vision and Core ML to scan and recognize texts from an image.
Now that iOS 13 is here, the Vision API is vastly improved. Besides, VisionKit framework is now introduced which allows us to scan documents using Camera.

Vision and VisionKit

Vision API came out with iOS 11. Up until now, it could only detect text and not return us the actual content. Hence we had to bring in Core ML for the recognition part.

Now that the Vision API is upgraded with iOS 13, the VNRecognizedTextObservation returns us the text, it’s confidence level as well as the bounding box coordinates.

Furthermore, VisionKit allows us to access the system’s document camera to scan pages.

VNDocumentCameraViewController is the view controller and VNDocumentCameraViewControllerDelegate is used to handle the delegate callbacks.

Launching a Document Camera

The following code is used to present the Document Camera on the screen.

let scannerViewController = VNDocumentCameraViewController()
scannerViewController.delegate = self
present(scannerViewController, animated: true)

Once the scan(s) are done and you click ‘Save’, the following delegate method gets triggered.

documentCameraViewController(_ controller: VNDocumentCameraViewController, didFinishWith scan: VNDocumentCameraScan)

To get a particular scanned image among the multiple images, pass the index of the page in the method:
scan.imageOfPage(at: index).

We can then process that image and detect the texts using the Vision API.

To process multiple images, we can loop through the scans in the delegate method in the following way:

for i in 0 ..< scan.pageCount {
        let img = scan.imageOfPage(at: i)
        processImage(img)
    }

Creating VNTextRecognitionRequest

let request = VNRecognizeTextRequest(completionHandler: nil)
request.recognitionLevel = .accurate
request.recognitionLanguages = ["en_US"]

recognitionLevel can be also set to fast. But then we'd have to deal with the less accuracy.
recognitionLanguages is an array of languages passed in priority order from left to right.

We can also pass custom words which are NOT a part of the dictionary for Vision to recognize.

request.customWords = ["IOC", "COS"]

In the following section, let's create a simple XCode Project in which we'll recognize texts from the captured images using Vision Request Handler.

We're setting our deployment target to iOS 13.

Our Storyboard

ios-13-vision-text-scanner-storyboard

Code

The code for the ViewController.swift file is given below:

import UIKit
import Vision
import VisionKit

class ViewController: UIViewController, VNDocumentCameraViewControllerDelegate {

    @IBOutlet weak var imageView: UIImageView!
    @IBOutlet weak var textView: UITextView!
    
    var textRecognitionRequest = VNRecognizeTextRequest(completionHandler: nil)
    private let textRecognitionWorkQueue = DispatchQueue(label: "MyVisionScannerQueue", qos: .userInitiated, attributes: [], autoreleaseFrequency: .workItem)
    
    override func viewDidLoad() {
        super.viewDidLoad()
        textView.isEditable = false
        setupVision()
    }

    @IBAction func btnTakePicture(_ sender: Any) {
        
        let scannerViewController = VNDocumentCameraViewController()
        scannerViewController.delegate = self
        present(scannerViewController, animated: true)
    }
    
    private func setupVision() {
        textRecognitionRequest = VNRecognizeTextRequest { (request, error) in
            guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
            
            var detectedText = ""
            for observation in observations {
                guard let topCandidate = observation.topCandidates(1).first else { return }
                print("text \(topCandidate.string) has confidence \(topCandidate.confidence)")
    
                detectedText += topCandidate.string
                detectedText += "\n"
            }
            
            DispatchQueue.main.async {
                self.textView.text = detectedText
                self.textView.flashScrollIndicators()

            }
        }

        textRecognitionRequest.recognitionLevel = .accurate
    }
    
    private func processImage(_ image: UIImage) {
        imageView.image = image
        recognizeTextInImage(image)
    }
    
    private func recognizeTextInImage(_ image: UIImage) {
        guard let cgImage = image.cgImage else { return }
        
        textView.text = ""
        textRecognitionWorkQueue.async {
            let requestHandler = VNImageRequestHandler(cgImage: cgImage, options: [:])
            do {
                try requestHandler.perform([self.textRecognitionRequest])
            } catch {
                print(error)
            }
        }
    }
    
    func documentCameraViewController(_ controller: VNDocumentCameraViewController, didFinishWith scan: VNDocumentCameraScan) {
        guard scan.pageCount >= 1 else {
            controller.dismiss(animated: true)
            return
        }
        
        let originalImage = scan.imageOfPage(at: 0)
        let newImage = compressedImage(originalImage)
        controller.dismiss(animated: true)
        
        processImage(newImage)
    }
    
    func documentCameraViewController(_ controller: VNDocumentCameraViewController, didFailWithError error: Error) {
        print(error)
        controller.dismiss(animated: true)
    }
    
    func documentCameraViewControllerDidCancel(_ controller: VNDocumentCameraViewController) {
        controller.dismiss(animated: true)
    }

    func compressedImage(_ originalImage: UIImage) -> UIImage {
        guard let imageData = originalImage.jpegData(compressionQuality: 1),
            let reloadedImage = UIImage(data: imageData) else {
                return originalImage
        }
        return reloadedImage
    }
}

The textRecognitionWorkQueue is a DispatchQueue used to run the vision request handler outside the main thread.

In the processImage function, we pass the image to a request handler which performs the text recognition.
VNRecognizedTextObservation is returned for each of the request's results.
From the VNRecognizedTextObservation we can look up to 10 candidates. Typically the top candidate gives us the most accurate result.

topCandidate.string returns the text and topCandidate.confidenceLevel returns us the confidence of the recognized text.

To get the bounding box for the string in the image we can use the function

topCandidate.boundingBox(for: topCandidate.string.startIndex..< topCandidate.string.endIndex). 

This gives us CGRect which we can draw over the image.

Note: Vision Uses a different coordinate space than the UIKit, hence when drawing the bounding boxes, you need to flip the Y-Axis.

Output

Let's look at the output of the application in action.

ios-vision-document-scanner

So we just captured the cover of a bestselling novel and guess what, we were able to recognize to display the texts in a TextView on our screen.

That sums up Vision Text Recogniser for iOS 13.
The full source code material is available here.

Leave a Reply

Your email address will not be published. Required fields are marked *