In this guide we will train a Caffe model using DIGITS on an EC2 g2.2xlarge
instance, convert it into a CoreML model using Apple’s coremltools and integrate it into an iOS app. If some of what mentioned is unfamiliar or sounds intimidating, fear not I will try to cover the process inasmuch detail as possible. we will do all of this while building SeeFood, the app first introduced on HBO’s popular TV show Silicon Valley. The difference is that we will succeed where Erlich and Jing Yang failed and go beyond Hot Dog/Not Hot Dog. By the end of this guide we will have trained our own model and integrated it in an iOS app that will detect from 101 dishes locally on the device.
This guide is broken down into three parts:
Each part builds on the part before it, however I will provide all the resources needed to pick up and continue from anywhere.
Just like most things in life there are many ways to achieve the same goal. This guide covers one of many ways to train, convert and integrate an image classifier into an iOS app. If any of the decisions made in this guide does not suite you, I encourage you to explore and find what works best for you.
If you have suggestions or want to reach out to me directly, feel free to do so through Twitter: Reza Shirazian
There are a set of prerequisites that you need before continuing with this guide:
- An AWS account with enough credit to run an g2.2xlarge for a 6-7 hours
- Xcode 9 (Version 9.0 beta 5 (9M202q))
- An iOS device running iOS 11 (currently in beta)
- A basic understanding of programming
If you’re all set we can begin
Part 1: Train
This part can technically be split into two different parts: Setup and Training. First we’ll cover setting up an EC2 instance and the then we’ll focus on the actual training. I will highlight the process for setting up the instance in more details since it’s something that might be a bit more foreign to iOS developers. If you’re an expert with AWS then here is the summary of the setup:
We will launch a preconfigured amazon machine image (AMI) that will have Caffe and NVIDIA’s DIGITS already set up. This image was created by Satya Mallick and is named “bigvision-digits” under public images. (If you cannot find it, it’s most likely because this AMI is on the Oregon region. Make sure your current regions is set as such) I encourage you to check out his website and view his youtube tutorial which will have a lot of overlap with what’s highlighted in this guide. We will use a g2.2xlarge
instance to take advantage of GPU computing for more efficient training. This instance type is not free and at the time of writing costs about $0.65/hour. For more up-to-date pricing check out Amazon instance pricing. Fortunately you won’t need to run the instance for more than 6-7 hours, but do make sure to stop it when you’re done. Running one of these for a whole month can cost close to $500.
Setup
Sign into your AWS console and click on EC2
Select AMIs on the left hand side
Select Public Images from the dropdown next to the search bar
Search for bigvision-digits
, select the image and click launch. Make sure you’re in the Oregon region. Your current region is displayed on the top right hand side, between your name and Support.
Select g2.2xlarge
from the list of instance types, click “Next: Configure Instance Details”
There is no need to change anything here, Click on “Add Storage”
Change size to 75 Gib and click “Add Tags”
Add Key “Name” and set its value to something descriptive. In this example I picked “Caffe DIGITS”. Click on “Configure Security Group”
Click Add Rule, select Custom TCP with protocol TCP, Port Range 80.
Set Source for SSH and Custom TCP to My IP.
Click “Review and Launch”.
Review everything and click “Launch”
Select “Create a new key pair” and use “digits” as key pair name. Click “Download Key Pair”
This will download a file named digits.pem
. Remember where this file is and make sure you don’t lose it. You will need it to access your instance later.
Click “Launch Instances”.
Wait for a few minutes while the instance gets setup. Click on your instance and look at its description. Once the Instance State is set to running and you can see an IP value by “IPv4 Public IP” your EC2 is ready.
Copy and past the public IP into your browser and you should see the following
DIGITS is up and running. Before we can begin using our instance and DIGITS to train our model we need a dataset. We will be using Food 101 dataset which can be found here . To download the dataset into our instance we need to SSH into it. Open your terminal and follow along
Navigate to the folder where you downloaded digits.pem
and change the file’s access permission to 600 (for more info on what this means click here):
chmod 600 digits.pem
SSH into your instance by running the following command, replace YOUR INSTANCE’S PUBLIC ADDRESS with your instance’s public address. The very same IP you pasted into your browser to view DIGITS.
ssh -i digits.pem ubuntu@[YOUR INSTANCE'S PUBLIC ADDRESS]
After a few seconds you should be connected to your instance. If you’re unfamiliar with with the terminal fear not the next few steps are fairly straightforward.
View the folders available to you.
ls
Go to the data folder
cd data
In here you will see a folder named 17flowers
which is part of Satya Mallick’s tutorial. He uses this dataset to train a flower classifier. We however will use a different dataset to train a different model.
Create a new folder named 101food
mkdir 101food
Navigate into the folder.
cd 101food
Download the dataset:
wget "http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz"
This process will take awhile. This dataset is more than 5GB is size. Once the file is downloaded unzip it using the following command
tar -xzf food-101.tar.gz
This too will take a few minutes. Once done there should be a food-101
folder. Navigate to its ` images` folder and you should see a folder for each dish type we will classify.
Exit out of the terminal and go back to your browser.
Training
Navigate to your instance’s public IP. Under “Datasets” selects “images” and then “classification”.
Change “Resize Transformation” to “Crop”, select the folder for food-101 images as the “Training Images”. If you’ve followed this guide your folder will be /home/ubuntu/data/101food/food-101/images
. Set “Dataset Name”to 101food
and click create
DIGITS will begin to create a database based off your dataset. This database will be used by Caffe during training.
When the process is complete you should be able to explore your database. With the database ready we are now ready to train our model. If you ever use your own dataset, it’s worth knowing that DIGITS doesn’t require any specific mapping or label file. It will create the database based off the folder structure.
Go back home on your DIGITS instance. On the “Models” panel select the “Images” dropdown and pick “Classifications”.
On the “New Image Classification Model” page under “Select Dataset” select “101food”. Under “Standard Networks” tab select “Caffe” and pick “AlexNet” as your model. Next select “Customize.”
Technically you could start training your model now but to get better accuracy we are going to use AlexNet and fine-tune existing weights from the original trained model with our dataset. To do this we need to select the original pretrained model which is available with Caffe. Under “Pretrained model” type /home/ubuntu/models/bvlc_alexnet.caffemodel
. (DIGITS will provide autocompletion so finding the original model should be easy)
Next we need to change the “Base Learning Rate”. Since this model is already trained we need our dataset to retrain its weights at a smaller rate. So change “Base Learning Rate” from 0.01
to 0.001
.
Next we need to change the name of the final layer on our network. The pretrained model for AlexNet was trained to classify 1000 objects, our dataset classifies 101. This can be done by simply changing fc8
to something else. In this example we’ll be changing it to fc8_food
. So under “Customize Networks” change all instances of fc8
to fc8_food
.
Visualize the model to ensure that the last layer has successfully been renamed to “fc8_food”.
For a more indepth explanation of this process check out Flickr’s fine tuning of CaffeNet: Fine-tuning CaffeNet for Style Recognition on “Flickr Style” Data
Name your model and click create.
DIGITS will begin to train your model. This process can take some time. For me, with a g2.2xlarge
instance it took about four hours.
Once the process is complete, we’ll see that our model can detect various dishes at a 65% success rate. This is ok but not perfect. To get a better accuracy, you would need a larger dataset which is something you can take on, but for now we’re going to settle for the 65% and proceed.
At the bottom upload or use a URL of an image to test the trained model. Do this multiple times with images that were not part of the training set. Check “Show visualizations and statistics” to see how each layer processed the image.
The model is trained and ready to be converted to CoreML. Click “Download Model”.
Click here to download the trained model
Part 2: Convert
To convert an existing model to CoreML’s .mlmodel
we need to install coremltools
. Check out https://pypi.python.org/pypi/coremltools to familiarize yourself with the library.
To install coremltools run from the terminal
pip install -U coremltools
coremltools requires Python 2.7 If you get Could not find a version that satisfies the requirement coremltools (from versions: ) error it’s likely that you’re not running Python 2.7. If you don’t get this error and everything works then you’re luckier than I was and can skip the next few steps.
One way to solve this is to setup a virtual environment with Python 2.7. You can do this by using Anaconda Navigator which can be downloaded along with Anaconda here
Once you have Anaconda Navigator running click on “Environments” on the left hand side.
And select “Create” under the list of environments.
Type “coreml” for name, check Python and select “2.7” from the dropdown. Click create.
You should have the new “coreml” environment up and running after a few second. Click on the new environment and press play. A new command line should open up.
If you run python --version
you should see that your new environment is now running under python 2.7
Run pip install -U coremltools
from your new command line. If you see a no file or directory error for any of the packages run pip install -U coremltools --ignore-installed
Navigate to the folder where you downloaded the trained model. Create a new file named run.py
and write the following code.
import coremltools
# Convert a caffe model to a classifier in Core ML
coreml_model = coremltools.converters.caffe.convert(('snapshot_iter_24240.caffemodel',
'deploy.prototxt',
'mean.binaryproto'),
image_input_names = 'data',
class_labels = 'labels.txt',
is_bgr=True, image_scale=255.)
# Now save the model
coreml_model.author = "Reza Shirazian"
coreml_model.save('food.mlmodel')
The first line imports coremltools
. Then we create a coreml_model
and provide all the necessary input for coremltool to convert the Caffe model into a .mlmodel
. The references to the files passed as the parameter works as is if run.py
is in the same folder as the unpacked trained model downloaded earlier from DIGITS.
You can provide other metadata such as author, description and licence for the converted model. For more details I suggest going over coremltools documentation here: https://pythonhosted.org/coremltools/
Save run.py
and run it from the console:
Python run.by
This process will take a few minutes. Once completed you will have a food.mlmodel
in the same folder as run.py
. You’re now ready to integrate your trained CoreML model into an iOS project.
Click here to download the CoreML model
Part 3: Integrate
Start a new Xcode project. Select iOS and Single View App.
Name your app “SeeFood” and click next. Select the folder where you wish to create your new project and click Create.
Once you’ve created your project, right click on the SeeFood folder under Project Navigator on the left and select Add files to ‘SeeFood’…
Find and select food.mlmodel
from the folder where we ran run.py
. Make sure Copy items if needed under Options is selected and click add.
After a few second food.mlmodel
should appear under your Project Navigator. Click on the file and you should see the models detail.
Make sure under inputs you see data Image<BGR 227,227>
and under outputs you see prob Dictionary<String,Double>
and classLabel String
. If you see anything else, remove the model and insert it again. If that doesn’t work, check the run.py
we created earlier and make sure it is exactly as what’s used in this example and that it has access to all the files it’s referencing. Also ensure that the target membership for project is check in the File Inspector.
We’re now ready to work with our CoreML model. Open ViewController
and add the following before the class definition
swift
import UIKit
import CoreML
import Vision
The add the following function at the bottom of ViewController
func detectScene(image: CIImage) {
guard let model = try? VNCoreMLModel(for: food().model) else {
fatalError()
}
// Create a Vision request with completion handler
let request = VNCoreMLRequest(model: model) { [unowned self] request, error in
guard let results = request.results as? [VNClassificationObservation],
let _ = results.first else {
self.settingImage = false
return
}
DispatchQueue.main.async { [unowned self] in
if let first = results.first {
if Int(first.confidence * 100) > 1 {
self.iSee.text = "I see \(first.identifier) \(self.addEmoji(id: first.identifier))"
self.settingImage = false
}
}
}
}
let handler = VNImageRequestHandler(ciImage: image)
DispatchQueue.global(qos: .userInteractive).async {
do {
try handler.perform([request])
} catch {
print(error)
}
}
}
To test our CoreML model, add some sample images that were not part of the training set to our Assets.xcassets
folder
Go back to the viewController
and add the following line in the viewDidLoad
method after super.viewDidLoad()
if let uiExamle = UIImage(named:"pizza"),
let example = CIImage(image: uiExamle) {
self.detectScene(image: example)
}
change pizza
to the name of whatever image file you added to Assets.xcassets
. Build and run the app. Ignore the Simulator and take a look at the output in the console.
Great! our CoreML model is working. However classifying an image from Assets.xcassets
is not super useful. Lets build out the app so it continuously takes a frame from the camera, runs it through our classifier and displays on the screen what it thinks it sees. CoreML is pretty fast and this makes for a much better user experience than having to take a picture and then run it through the classifier.
Click on Main.storyboard
. Add a UIImageView
and a UILabel
to the ViewController
and link it back to an outlet on ViewController.swift
@IBOutlet weak var previewImage: UIImageView!
@IBOutlet weak var iSee: UILabel!
To capture individual frames from the camera we’re going to use the FrameExtractor
. A class described here by Boris Ohayon. The original classes is written in Swift 3, I have made the changes necessary and converted it to Swift 4 so you can copy past it directly from here. I do suggest going through it to understand how AVFoundation works. I’m not going to get into too much details since it’s outside the scope of CoreML and this guide but AVFoundation is definitely worth exploring. If you wish to dive into it, this is a good place to start
import UIKit
import AVFoundation
protocol FrameExtractorDelegate: class {
func captured(image: UIImage)
}
class FrameExtractor: NSObject, AVCaptureVideoDataOutputSampleBufferDelegate {
private var position = AVCaptureDevice.Position.back
private let quality = AVCaptureSession.Preset.medium
private var permissionGranted = false
private let sessionQueue = DispatchQueue(label: "session queue")
private let captureSession = AVCaptureSession()
private let context = CIContext()
weak var delegate: FrameExtractorDelegate?
override init() {
super.init()
checkPermission()
sessionQueue.async { [unowned self] in
self.configureSession()
self.captureSession.startRunning()
}
}
public func flipCamera() {
sessionQueue.async { [unowned self] in
self.captureSession.beginConfiguration()
guard let currentCaptureInput = self.captureSession.inputs.first else { return }
self.captureSession.removeInput(currentCaptureInput)
guard let currentCaptureOutput = self.captureSession.outputs.first else { return }
self.captureSession.removeOutput(currentCaptureOutput)
self.position = self.position == .front ? .back : .front
self.configureSession()
self.captureSession.commitConfiguration()
}
}
// MARK: AVSession configuration
private func checkPermission() {
switch AVCaptureDevice.authorizationStatus(for: AVMediaType.video) {
case .authorized:
permissionGranted = true
case .notDetermined:
requestPermission()
default:
permissionGranted = false
}
}
private func requestPermission() {
sessionQueue.suspend()
AVCaptureDevice.requestAccess(for: AVMediaType.video) { [unowned self] granted in
self.permissionGranted = granted
self.sessionQueue.resume()
}
}
private func configureSession() {
guard permissionGranted else { return }
captureSession.sessionPreset = quality
guard let captureDevice = selectCaptureDevice() else { return }
guard let captureDeviceInput = try? AVCaptureDeviceInput(device: captureDevice) else { return }
guard captureSession.canAddInput(captureDeviceInput) else { return }
captureSession.addInput(captureDeviceInput)
let videoOutput = AVCaptureVideoDataOutput()
videoOutput.setSampleBufferDelegate(self, queue: DispatchQueue(label: "sample buffer"))
guard captureSession.canAddOutput(videoOutput) else { return }
captureSession.addOutput(videoOutput)
guard let connection = videoOutput.connection(with: AVFoundation.AVMediaType.video) else { return }
guard connection.isVideoOrientationSupported else { return }
guard connection.isVideoMirroringSupported else { return }
connection.videoOrientation = .portrait
connection.isVideoMirrored = position == .front
}
private func selectCaptureDevice() -> AVCaptureDevice? {
return AVCaptureDevice.default(for: .video)
}
// MARK: Sample buffer to UIImage conversion
private func imageFromSampleBuffer(sampleBuffer: CMSampleBuffer) -> UIImage? {
guard let imageBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return nil }
let ciImage = CIImage(cvPixelBuffer: imageBuffer)
guard let cgImage = context.createCGImage(ciImage, from: ciImage.extent) else { return nil }
return UIImage(cgImage: cgImage)
}
// MARK: AVCaptureVideoDataOutputSampleBufferDelegate
func captureOutput(_ captureOutput: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
guard let uiImage = imageFromSampleBuffer(sampleBuffer: sampleBuffer) else { return }
DispatchQueue.main.async { [unowned self] in
self.delegate?.captured(image: uiImage)
}
}
}
Go back to ViewController
. To get an image, classify it and display the top prediction on the screen, change the ViewController
so it looks like this:
import UIKit
import CoreML
import Vision
import AVFoundation
class ViewController: UIViewController, FrameExtractorDelegate {
var frameExtractor: FrameExtractor!
@IBOutlet weak var previewImage: UIImageView!
@IBOutlet weak var iSee: UILabel!
var settingImage = false
var currentImage: CIImage? {
didSet {
if let image = currentImage{
self.detectScene(image: image)
}
}
}
override func viewDidLoad() {
super.viewDidLoad()
frameExtractor = FrameExtractor()
frameExtractor.delegate = self
}
func captured(image: UIImage) {
self.previewImage.image = image
if let cgImage = image.cgImage, !settingImage {
settingImage = true
DispatchQueue.global(qos: .userInteractive).async {[unowned self] in
self.currentImage = CIImage(cgImage: cgImage)
}
}
}
func addEmoji(id: String) -> String {
switch id {
case "pizza":
return "🍕"
case "hot dog":
return "🌭"
case "chicken wings":
return "🍗"
case "french fries":
return "🍟"
case "sushi":
return "🍣"
case "chocolate cake":
return "🍫🍰"
case "donut":
return "🍩"
case "spaghetti bolognese":
return "🍝"
case "caesar salad":
return "🥗"
case "macaroni and cheese":
return "🧀"
default:
return ""
}
}
func detectScene(image: CIImage) {
guard let model = try? VNCoreMLModel(for: food().model) else {
fatalError()
}
// Create a Vision request with completion handler
let request = VNCoreMLRequest(model: model) { [unowned self] request, error in
guard let results = request.results as? [VNClassificationObservation],
let _ = results.first else {
self.settingImage = false
return
}
DispatchQueue.main.async { [unowned self] in
if let first = results.first {
if Int(first.confidence * 100) > 1 {
self.iSee.text = "I see \(first.identifier) \(self.addEmoji(id: first.identifier))"
self.settingImage = false
}
}
// results.forEach({ (result) in
// if Int(result.confidence * 100) > 1 {
// self.settingImage = false
// print("\(Int(result.confidence * 100))% it's \(result.identifier) ")
// }
// })
// print("********************************")
}
}
let handler = VNImageRequestHandler(ciImage: image)
DispatchQueue.global(qos: .userInteractive).async {
do {
try handler.perform([request])
} catch {
print(error)
}
}
}
}
The code above is fairly straightforward: we set our ViewController
to conform to FrameExtractorDelegate
. We create an instance of FrameExtractor
named frameExtractor
. We set its delegate
to self
and implement func captured(image: UIImage)
to complete the delegation implementation.
We declare a CIImage
variable named currentImage
and set its value whenever captured
returns an image. We add a didSet
to the currentImage
to observe when its value changes and call detectScene
with the new image. Since captured
will take less time than detectScene
, to prevent continuous calls into detectScene
before it’s done we set a boolean flag called settingImage
. This flag is set to true when a new image has been captured and set to false when detectScene
has classified it. If it’s true we skip the image captured.
Build and deploy the app on a device running iOS 11. The first time the app runs it will ask for a permission to use camera. If you’ve been following this guide so far, your app will most likely crash. The error you will see is The app’s Info.plist must contain an NSCameraUsageDescription key.
error. Since iOS 10 you need to provide a usage description which is shown to the user when the popup for a specific permission is displayed. To fix this, a string value describing your reason needs to be added to the info.plist
file for NSCameraUsageDescription
key.
Build and run. The app should launch, once the permission is given to use the camera, you should see what the camera is seeing and the label should update to what CoreML thinks is in front of it.
Congratulation, you just trained and integrated your own image classifier and deployed it to an iOS device! As lengthy and convoluted the process might appear, once you go through it you’d realize it’s simple. Thanks to Caffe, DIGITS and now CoreML the hard parts have been figured out and it’s up to you to collect your dataset, train your models and build awesome apps. The amount of coding is minimal and the power is immense. Machine learning is the future and I would love to see what you do with it. Feel free to hit me up on twitter and show me your creations!
Click here to view repo for the full project
Glossary
- AlexNet: a convolutional neural network which competed in the ImageNet Large Scale Visual Recognition Challenge in 2012. AlexNet was designed by Alex Krizhevsky.
- Anaconda Navigator: a desktop graphical user interface (GUI) that is included in Anaconda® and allows you to launch applications and easily manage conda packages, environments and channels without using command-line commands.
- Caffe: a deep learning framework made with expression, speed, and modularity in mind. It is developed by [Berkeley AI Research](http://bair.berkeley.edu/)
- CoreML:[CoreML](https://developer.apple.com/documentation/coreml): Apple's machine learning framework. It can integrate machine learning models into your app.
- coremltool: a python package for creating, examining, and testing models in the .mlmodel format.
- DIGITS: simplify common deep learning tasks such as managing data, designing and training neural networks on multi-GPU systems, monitoring performance in real time with advanced visualizations, and selecting the best performing model from the results browser for deployment. DIGITS is completely interactive so that data scientists can focus on designing and training networks rather than programming and debugging.
- EC2: Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.
- Food 101 Food-101 -- Mining Discriminative Components with Random Forests by Bossard, Lukas and Guillaumin, Matthieu and Van Gool, Luc
Further readings
If you’ve made it this far, congratulations again. Although we’ve covered a lot, we haven’t even scratched the surface. Here are some suggestions as to where to go from here:
WWDC 2017 - Introducing Core ML
WWDC 2017 - Core ML in depth
WWDC 2017 - Vision Framework
Apple CoreML Documentation
Caffe: Convolutional Architecture for Fast Feature Embedding Caffe paper
ImageNet Classification with Deep Convolutional Neural Networks AlexNet paper
DIY Deep Learning for Vision: A hands on tutorial with Caffe
Caffe Tutorial
Deep Learning for Computer Vision with Caffe and cuDNN
Hacker’s guide to Neural Networks Great article by Andrej Karpathy
Deep Learning using NVIDIA DIGITS 3 on EC2 by Satya Mallick.
Matthijs Hollemans Matthijs Hollemans’ blog on Machine Learning.