iOS Text Recognition Using Vision And Core ML

Vision and Core ML frameworks were the highlights of WWDC 2017. Vision is a powerful framework used to implement computer vision features without much prior knowledge of algorithms.

Things such as barcode, face, object and text detection can be easily done using Vision.
At the same time, Core ML allows us to integrate and run pre-trained models without digging too deep in Machine Learning.

Our goal for today is to build an iOS Application that identifies texts in a still image.

Before getting down to the business end, let’s for once breeze through the things we’ll gonna cover.

Topics Covered

  • Capturing Image Using Camera or Gallery
  • Text Detection Using Vision
  • Text Recognition Using Core ML
  • Visually indicating selectively detected texts using a bounding box

The last point is pretty much the highlight of this article.
It’s the highlight since we’ll essentially highlight a few selective words detected in the image.
Now, this is mighty powerful! It’s quite similar to a Find Text.. feature wherein the keyword gets highlighted throughout a page.

What we want to achieve

We wish to highlight some of the detected texts after recognizing their names in an image captured from camera/gallery as shown below:

We’ll call this application as FindMyText. Inspired from FindMyIphone!

Without wasting any more time, let’s get started. Launch up your XCode and create a Single View Application.

Getting Started

We won’t be focusing on the Storyboard since it’s pretty basic as shown under.

ios-vision-coreml-text-recognition-storyboard

To refer to the storyboard, you can download the source code from Github.

It contains a button and an image view. On clicking the button, we show a picker view using the code below:

guard UIImagePickerController.isSourceTypeAvailable(.camera) else {
            presentPhotoPicker(sourceType: .photoLibrary)
            return
        }
        let photoSourcePicker = UIAlertController()
        let takePhoto = UIAlertAction(title: "Camera", style: .default) { [unowned self] _ in
            self.presentPhotoPicker(sourceType: .camera)
        }
        let choosePhoto = UIAlertAction(title: "Photos Library", style: .default) { [unowned self] _ in
            self.presentPhotoPicker(sourceType: .photoLibrary)
        }
        photoSourcePicker.addAction(takePhoto)
        photoSourcePicker.addAction(choosePhoto)
        photoSourcePicker.addAction(UIAlertAction(title: "Cancel", style: .cancel, handler: nil))
        
        present(photoSourcePicker, animated: true)

presentPhotoPicker is used to launch the appropriate application. Once the image gets clicked we start the vision request.

extension ViewController: UIImagePickerControllerDelegate, UINavigationControllerDelegate {
    
    func imagePickerController(_ picker: UIImagePickerController, didFinishPickingMediaWithInfo info: [UIImagePickerController.InfoKey: Any]) {
        picker.dismiss(animated: true)
        
        guard let uiImage = info[UIImagePickerController.InfoKey.originalImage] as? UIImage else {
            fatalError("Error!")
        }
        imageView.image = uiImage
        createVisionRequest(image: uiImage)
    }
    
    private func presentPhotoPicker(sourceType: UIImagePickerController.SourceType) {
        let picker = UIImagePickerController()
        picker.delegate = self
        picker.sourceType = sourceType
        present(picker, animated: true)
    }
}

Having completed the first part, let’s jump onto Vision next.

Vision Framework

Vision Framework had come up with iOS 11. It brings algorithms for image recognition and analysis which as per Apple, are more accurate that the CoreImage Framework. A significant contributor to this is the underlying use of Machine Learning, Deep Learning, and Computer vision.

Implementing the framework consists of three important usecases/terms:

  • Request – Create a request to detect the type of object. You can set more than one types to be detected.
  • Request Handler – This is used to process the results obtained from the request.
  • Observation – The results are stored in the form of observation.

Some important classes which are a part of the Vision framework are:

  • VNRequest – It consists of an array of requests which are used for image processing.
  • VNObservation – This gives us the output of the result.
  • VNImageRequestHandler — processes one or more VNRequest on given image.

The following snippet shows how to create a vision request. Rather a VNImageRequestHandler.

func createVisionRequest(image: UIImage)
    {
        
        currentImage = image
        guard let cgImage = image.cgImage else {
            return
        }
        let requestHandler = VNImageRequestHandler(cgImage: cgImage, orientation: image.cgImageOrientation, options: [:])
        let vnRequests = [vnTextDetectionRequest]
        
        DispatchQueue.global(qos: .background).async {
            do{
                try requestHandler.perform(vnRequests)
            }catch let error as NSError {
                print("Error in performing Image request: \(error)")
            }
        }
        
    }

We could have passed multiple requests, but the goal of this article is text detection and recognition.

The vnTextDetectionRequest is defined in the below code:

var vnTextDetectionRequest : VNDetectTextRectanglesRequest{
        let request = VNDetectTextRectanglesRequest { (request,error) in
            if let error = error as NSError? {
                print("Error in detecting - \(error)")
                return
            }
            else {
                guard let observations = request.results as? [VNTextObservation]
                    else {
                        return
                }
                
                var numberOfWords = 0
                for textObservation in observations {
                    var numberOfCharacters = 0
                    for rectangleObservation in textObservation.characterBoxes! {
                        let croppedImage = crop(image: self.currentImage, rectangle: rectangleObservation)
                        if let croppedImage = croppedImage {
                            let processedImage = preProcess(image: croppedImage)
                            self.imageClassifier(image: processedImage,
                                               wordNumber: numberOfWords,
                                               characterNumber: numberOfCharacters, currentObservation: textObservation)
                            numberOfCharacters += 1
                        }
                    }
                    numberOfWords += 1
                }
                
                DispatchQueue.main.asyncAfter(deadline: .now() + 3, execute: {
                    self.drawRectanglesOnObservations(observations: observations)
                })
                
            }
        }
        
        request.reportCharacterBoxes = true
        
        return request
    }

There’s plenty of stuff going in the above code snippet.
We will break it down.

  • The observations are the results returned by the request.
  • Our goal is to highlight the detected texts with bounding boxes, hence we’ve typecasted the observations to
    VNTextObservation.
  • We crop the detected text part of the image. These cropped images act as micro-inputs for our ML model.
  • We feed these images to the Core ML model after preprocessing them to the required input size.

The codes for the crop and preprocessing are available in the ImageUtils.swift file. You can view it in source code linked at the end of this article.

drawRectangleOnObservations is what essentially highlights some of the recognized texts.
The method implementation is given below:

func drawRectanglesOnObservations(observations : [VNDetectedObjectObservation]){
        DispatchQueue.main.async {
            guard let image = self.imageView.image
                else{
                    print("Failure in retriving image")
                    return
            }
            let imageSize = image.size
            var imageTransform = CGAffineTransform.identity.scaledBy(x: 1, y: -1).translatedBy(x: 0, y: -imageSize.height)
            imageTransform = imageTransform.scaledBy(x: imageSize.width, y: imageSize.height)
            UIGraphicsBeginImageContextWithOptions(imageSize, true, 0)
            let graphicsContext = UIGraphicsGetCurrentContext()
            image.draw(in: CGRect(origin: .zero, size: imageSize))
            
            graphicsContext?.saveGState()
            graphicsContext?.setLineJoin(.round)
            graphicsContext?.setLineWidth(8.0)
            
            graphicsContext?.setFillColor(red: 0, green: 1, blue: 0, alpha: 0.3)
            graphicsContext?.setStrokeColor(UIColor.green.cgColor)
            
            
            
            var previousString = ""
            let elements = ["VISION","COREML"]
            
            observations.forEach { (observation) in
                
                var string = observationStringLookup[observation as! VNTextObservation] ?? ""
                let tempString = string
                string = string.replacingOccurrences(of: previousString, with: "")
                string = string.trim()
                previousString = tempString
                
                if elements.contains(where: string.contains){
                    
                    let observationBounds = observation.boundingBox.applying(imageTransform)
                    graphicsContext?.addRect(observationBounds)
                }
                
                
            }
            graphicsContext?.drawPath(using: CGPathDrawingMode.fillStroke)
            graphicsContext?.restoreGState()
            
            let drawnImage = UIGraphicsGetImageFromCurrentImageContext()
            UIGraphicsEndImageContext()
            self.imageView.image = drawnImage
            
        }
    }

But how is this possible?
The Observation objects returned by Vision cannot recognize texts?
And what is observationStringLookup?

The answers are in the next section.

Core ML 2

Core ML is a framework which lets developers train and use ml models easily in their applications.
With the help of this framework, the input data can be processed to return the desired output.

In this project, we’re using an alphanum_28X28 ml model.
This model requires input images of size 28*28 and returns the detected text.

Resizing the images happens in the preprocess function we just saw earlier.

Coming to the answers in the previous section,
observationStringLookup is a dictionary which binds each Observation to its text determined by the Core ML model.

To find determine the text, we have our own imageClassifier method that was invoked after preprocessing the images:

func imageClassifier(image: UIImage, wordNumber: Int, characterNumber: Int, currentObservation : VNTextObservation){
        let request = VNCoreMLRequest(model: model) { [weak self] request, error in
            guard let results = request.results as? [VNClassificationObservation],
                let topResult = results.first else {
                    fatalError("Unexpected result type from VNCoreMLRequest")
            }
            let result = topResult.identifier
            let classificationInfo: [String: Any] = ["wordNumber" : wordNumber,
                                                     "characterNumber" : characterNumber,
                                                     "class" : result]
            self?.handleResult(classificationInfo, currentObservation: currentObservation)
        }
        guard let ciImage = CIImage(image: image) else {
            fatalError("Could not convert UIImage to CIImage :(")
        }
        let handler = VNImageRequestHandler(ciImage: ciImage)
        DispatchQueue.global(qos: .userInteractive).async {
            do {
                try handler.perform([request])
            }
            catch {
                print(error)
            }
        }
    }



func handleResult(_ result: [String: Any], currentObservation : VNTextObservation) {
        objc_sync_enter(self)
        guard let wordNumber = result["wordNumber"] as? Int else {
            return
        }
        guard let characterNumber = result["characterNumber"] as? Int else {
            return
        }
        guard let characterClass = result["class"] as? String else {
            return
        }
        if (textMetadata[wordNumber] == nil) {
            let tmp: [Int: String] = [characterNumber: characterClass]
            textMetadata[wordNumber] = tmp
        } else {
            var tmp = textMetadata[wordNumber]!
            tmp[characterNumber] = characterClass
            textMetadata[wordNumber] = tmp
        }
        objc_sync_exit(self)
        DispatchQueue.main.async {
            self.doTextDetection(currentObservation: currentObservation)
        }
    }
    
    func doTextDetection(currentObservation : VNTextObservation) {
        var result: String = ""
        if (textMetadata.isEmpty) {
            print("The image does not contain any text.")
            return
        }
        let sortedKeys = textMetadata.keys.sorted()
        for sortedKey in sortedKeys {
            result +=  word(fromDictionary: textMetadata[sortedKey]!) + " "
            
        }
        
        observationStringLookup[currentObservation] = result
        
    }
    
    func word(fromDictionary dictionary: [Int : String]) -> String {
        let sortedKeys = dictionary.keys.sorted()
        var word: String = ""
        for sortedKey in sortedKeys {
            let char: String = dictionary[sortedKey]!
            word += char
        }
        return word
    }

textMetadata is used to store all the words.
Now that the observationStringLookup is created, we can highlight the selected observations(the words vision, core ml were highlighted in the final output as we saw at the start of this article).

Note: Core ML model may not give correct results on texts with different fonts.

Fun fact: In WWDC 2019, the newly updated Vision Framework now stores the recognized text in the Observation instance itself. So we need not use Core ML for text recognition.
We’ll be covering this in detail in another article.
That’s it for now.
The full source code for the FindMyText app is available here.

Reference : https://martinmitrevski.com/2017/10/19/text-recognition-using-vision-and-coreml/

Leave a Reply

Your email address will not be published. Required fields are marked *