Getting Started with Machine Perception Using the Mobile Vision API (Google I/O '17)

Just another WordPress site

Getting Started with Machine Perception Using the Mobile Vision API (Google I/O '17)

[MUSIC PLAYING] HSIU WANG: I’m Hsiu YULONG LIU: Hi, I’m Yulong HSIU WANG: We are engineers from the mobile vision team Mobile vision team is about provide you with the greatest and latest computer vision algorithm that are run on privately on your device with low latency, and no internet access required Our API also work both for Android and iOS Now let’s take a look how it works [VIDEO PLAYBACK] [END PLAYBACK] [APPLAUSE] HSIU WANG: So what you have seen is a application that demonstrate how to use the mobile vision face, barcode, and text API First, the application scan barcode from a paper ad, bring the user directly to a product page Then you use a face API to try virtually the sunglasses on his face Last, he use the text API to scan credit card real quickly to do the payments So mobile vision API is quite popular We have 125 million 30-day active users This is number is contributed through more than 15,000 applications In our community, we see a lot of very interesting use cases For example, on our face API, we see people use the face API for photo correction type of application, like face blur detection, as well as red eye removal For barcode API, we see applications that use a wedding registry, as well as tracking 2D in space to place 3D object in the air like application For text API, we see application that do quick payment processing, kind of like what we show in the video As well as help blind user to see its surroundings After today’s talk, you can join our active community as well to build awesome applications The computer vi– the mobile vision API consists of four major components– the common utility API, and three specialized detection APIs The common utility API provides infrastructure and building blocks that help you construct a string pipeline The face API detects faces It also understand landmarks, like face and eyes– I mean, sorry, nose and eyes, as well as facial classification, like are you smiling, or is your eye closed For the barcode API, it can detects 1D and 2D barcodes for multiple formats in different orientation at the same time The text API can detect Latin-based languages For detecting a text, it also understand its structure, like paragraphs and line Before we take a deep dive into the code, I would like to talk about some basic concepts and use cases So the mobile version API works on both static image, as well as a pipeline The easiest use case is to use static images First, you want to create a detector Then you provide it with a image You run detection algorithm, and it will generate detection results This is what a pipeline looks like So the camera source that uses a camera API internally, it streams the camera frame to the detector The detector then run detection algorithm

to generate detection results After that, it hands over the detection results to a processor A processor is a first step post-processing It’s responsible to discard, merge, or deliver the detected item to its associated tracker A tracker is a event-based object listener It can notify for tracked item over time So the camera source detector and processor are provided by the mobile vision API The only portion you need to worry about is the tracker Tracker is a piece of code you want to write that implement your business logic Now you understand the basic concepts, let’s take a deep dive with a barcode API So barcode API sounds boring, but it’s surprisingly awesome if you think about it, barcode is everywhere It tracks everything So that the airplane ticket you purchased is tracked by barcode, as well as the cookie you ate today is tracked by barcodes So the barcode API for Android is provided by Google Play services You want to declare your dependency for the Gradle, and also during the runtime for your manifest file For mobile vision API, the first time it’s run on your device, it requires to download some additional vision models, but only for the first time Once it’s finished downloading, from then on, no internet is required In this slide, we’ll talk about how it’s used for static image case So it’s three steps First, you want to instantiate a barcode detector using a builder pattern The builder will allow you to specify what kind of barcode format you are interested In this example, we are specifying QR_CODE, and UPC_A So you ignore any other formats Next, you will provide with a image, and then you will run through the detection algorithm to provide you detected barcodes After you have the barcodes, then you can access its properties Every single barcode has some common property, like raw value– what is the encoded barcode value, corner point– where is this barcode located on a image, as well as barcode format– is it UPC_A? Is it QR_CODE? For 2D barcode, it can actually contain structured data We parse the data for you So what you want to do is, you want to check the value format In this particular case, it’s phone There’s also other value format, like contact, address, and others Once it’s parsed, then you can access it through objects that we provide So that is how you do static image detection Now let’s expand that to a pipeline Remember we talk about before, there are four steps to the pipeline– the camera source, the detector, the processor, and then the tracker So the first thing you want to do is instantiate a barcode detector, just like we talk about before Next, you want to instantiate a camera source And then you want to provide it to the detector So when the camera starts, you will automatically deliver the camera frame to the detector to run detection Next, you want to instantiate a tracker So remember, we said tracker is what you want to put your business logic as It’s a event-based listener, and these are the few methods you can override So onNewItem is called when a detected barcode is seen the first time for the pipeline If you want to add a overlay graphic, like the video we showed into your application, this is where you want to add your graphic onUpdate is called usually on every single frame when you get a update location for the detected barcode So in your example for the overlay, this is where you want to update your location onMissing is called when a tracked barcode is missing for the couple frame This can be due to occlusion, or the frame quality is blurred After several frame, like the barcode just nowhere to be seen, onDone is called onDone is called when the pipeline is releasing all the tracking resources related to this barcode This is where you want to also clean up your overlay graphics

Next we want to instantiate a processor This is where we hook up the detection results to a tracker The mobile vision API provides two flavor of the processor– the focusing processor and the multi-processor We’ll talk about multi-processor later in the presentation In this use case, we’re talking about a focusing processor Focusing processor allow you to select one barcode, and then focus on that barcode, continue deliver the notification until that barcode is no longer seen by the pipeline So this what the code looks like You want to override the select focus method Again, select focus method is only called when there’s no item to be focused on So you get a batch of barcode detection comes in, you select which barcode you want to focus on That barcode will be continue send notification to the tracker until onDone is called on that barcode After that, the select focus will get called again, and then you get to select what’s the next barcode you want to continue tracking So that’s it That’s how you construct a pipeline It’s very easy Four step– you instantiate detector, you instantiate tracker, a processor, and then you hook it all together using the camera source So the barcode API also works on iOS The barcode API for Android and iOS share the same common algorithm So when you use them, you get consistent results for both of your platform Our iOS API is provided through CocoaPods If you only want to use the detector, you can specify the barcode detector pods If you also want to use the pipeline, then you can also add the MVDataOutput pod Again, in this slide, we’re going to take a look how to use the iOS barcode detector in a static image detection First, we initiate the barcode detector in a factory pattern Just like before, you can specify what kind of barcode format you are interested in Next, you’ll provide it with a image, and then you will detect barcodes back Once you get a barcode, you can access its property All the properties we talk before are all available here– the raw value, corner points, value format So this is what the iOS pipeline looks like It should be seeming familiar It has four steps, just like before However, to respect, I always can mention we made some changes Instead of the camera source, we are using the AV capture session to interact with a camera Instead of the processor, we’re using data output, which is a delegate on the AV capture session to hook up the detection result to the tracker The rest should be fairly familiar So you want instantiate a detector You will instantiate a tracker where you put your business logic You are using a data output to hook up your detection result to your tracker And then you use the AV capture session to string the camera into the detection Next, Yulong will talk about face API YULONG LIU: Thanks Hello Now it’s turn of the face API So face API makes detect face in static image or video or camera stream really easy Using face API, you can build all kinds of fun apps with detecting face in it In the previous demo, we showed that you can use face API to build a sunglass try-on app So user take a picture of themself, or point the camera to themself They can see how they will look like if they are wearing sunglasses You can, of course, do more using face API For example, you can build a avatar app using face API User input his photo, and you can use face API to generate a avatar for the user One thing to note about face API is that this is face detection API It’s not face recognition So it could detect offices inside the image or video, but it has no idea about who these faces are Face API works very well on human faces, no matter the face is with extreme expression, or the face is obstructed Face API could accurately located the face in the image When detecting the face, face API also

report a list of positions on the face, which we call facial landmarks, including eyes, nose, and mouth, and so on These facial landmarks helps you better understand the position on the face, and the angle of the face Face API doesn’t require you to face the camera directly to make it work We support multiple angles, and face API will report you which angle the face is on Face API also support some facial activity classification It could detect if the person’s eye is open or not, as you can see from the video It also could detect if the person is smiling or not In order to use face API, the same Gradle dependency as barcode is used And in the runtime dependency in your Android manifest.xml, you want to specify that you are using face API so that our API could download the necessary file for you Once you have the dependency set up, the usage is very easy The first step is to instantiate a face detector using the builder pattern You can tell the builder a list of parameters, like for example, you can see I only want to detect face that is larger than 10% of the image size Or you can tell face API that I am only interested into the largest face of the image And if you wanted to do the classification, means is the eye open or not, is the person smiling, you need to enable the classification when building the builder The same for landmark– if you want to detect landmarks, you need to enable it when you’re building the builder And you can tell face detector to run at a accurate mode, or fast mode This is a tradeoff between the performance– I mean, the accuracy and the speed Once you have the instance of the face API, you can apply it on a static image, and you can get all these informations of the face, including the position of the face And if you enable the classification, you can get probability of, is the eye open or not? Is the person smiling? Of course, if you enable landmark detection, you can get a list of landmarks on the face The same as barcode API, face API also support video analysis So quick refresh of the pipeline So we have four components– we have a camera source which is responsible for getting frames from camera And the camera source pass these frame to face detector which faces will be detected And the focusing processor is responsible for select the right face, and [? roll ?] that face to a tracker Tracker is where you want to put your business logic in We will show in the next slide about how to implement a tracker class So config the whole pipeline together is very straightforward You instantiate a tracker, you instantiate a detector, the same as we did for static image And you tell the detector that you are using– in this example, you tell the detector that you are using largest face focusing processor, which only [? rolls ?] the largest face to the tracker, and omits the rest of the face Then you build the CameraSource class, and you start the CameraSource class, the whole pipeline will start to work This is what implementation of tracker class looks like Most importantly, two methods are needed The first method is onNewItem, which is called every time a new face is detected by the face detector This is– if you wanted to build the same app as we did in the video, that’s scan the face, and put a sunglass on the face, this is where you wanted to create the sunglass overlay on the face The next method is the onUpdate method This method is called every time the existing face has some updates It could be the position of the face changed, or it could be the person smiled, which he or she wasn’t This is where you wanted to update the sunglass overlay to a proper location In the previous talk, we only discussed to track one single object, no matter it’s the barcode or it’s a face But our API, of course, support multiple objects tracking The pipeline looks very similar The same for components The only difference is highlighted there, is that instead of using a focusing processor, you wanted to use a multi-processor here

Unlike focusing processor, which only [? rolls ?] one element to the tracker, multi-processor will create tracker instance for all the elements that is detected by the face detector The code looks very similar The only difference is highlighted, is you wanted to create a multi-processor– the Factory class, which will create tracker instance for all the faces that detected by the face detector And the rest of the pipeline are the same as the single face tracking For face API, we also have iOS support It’s available through CocoaPods If you wanted to– if you only need to run face API on static image, face detector is the right CocoaPod you want to use If you also want video support, you need to include the [INAUDIBLE] data output CocoaPod To use the API on iOS is very similar as on Android The first step is to instantiate a face detector with all the desired configurations Then you run the detector on a static image, and you can get the property of the face in a very simple way Again, for iOS API, we support the same video analysis, the same as Android API The pipeline is the same, except the name of the class changed Like for example, it’s CameraSource that is responsible for providing frames on Android, but on iOS is AVCaptureSession The face detector is responsible for detecting faces, while the focusing data output is basically a focusing processor And tracker is the same, where you want to put your business logic in In order to config the pipeline, the first step is to instantiate the face detector and a tracker And you instantiate a largest face-focusing data output, which is basically the same as the largest face focusing processor on Android Then you use the AVCaptureSession to get frames from a camera, and the whole pipeline will start work The implementation of the tracker is very similar, except the language is different, of course Two methods– the first method is called when a new face is detected The second method is called when a existing face has some updates We again have the same multi-face tracking on iOS, the same as Android The only difference is that you use a multi-data output instead of a focusing data output so that it will be responsible for creating tracker class for all the faces that is detected And the code change is very straightforward It’s only one class change So I’ll skip this slide The next API I’m going to talk about is text API Personally, I feel text API is very interesting You can build a lot of useful apps using text API For example, if you are building a payment app, you can do what we did in the video demo So the user can point the camera to their credit card, and you can use text API to extract the necessary information for the user, like the credit card number, the cardholder name, or the expiration date so the user don’t have to type all these information using their tiny keyboard Or you can build a business card scan app Take a picture of the business card, and extract the email address, the phone number, the name of the person, and save this information to content This is a short video demo of text API As you can see, we support multi-color schemes It’s very robust And we support multiple languages Currently we support all Latin character languages, which are more than 20 And when detecting and recognizing text, text API doesn’t only return you the content of the text, it also keeps the structure of the original text That means in the return value of the text API, there are three levels of return object The top level is the text block, which is essentially a paragraph in the original image And in each text block, there are multiple lines Each line corresponds to one single line in the original image And in each line, we report you the multiple words inside the line The same Gradle dependency if you wanted to use text API on Android, and the runtime dependency, you want

to specify that you are using text API so we can download the necessary file Although the development of text API is kind of difficult, but the usage is actually very easy The first step is to instantiate a builder, and build the text recognizer Then you to run it on a static image, and you can get a list of useful information of the text For example, you can get the language of the text As we mentioned before that, text API supports more than 20 languages We don’t require you to tell text API which language you are detecting Text the API could you automatically detect the language, and tell you which language the text is And you have got the position of the text Of course, most importantly, the value of the text You can call the getComponents method on a text block, which is paragraph, to gather lines in the paragraph And you can call the same method on a line to get all the words inside a line Same as the other two API, text API also support the video analysis The pipeline is the same, so I’ll just skip this slide to save your time A quick summary, here We currently have three API– we have a face API, which detects faces, as well as the associated facial landmarks, and facial activities We have barcode API, which decodes both 1D 2D barcodes We have a text API that recognize text from a image or video We support Latin character based languages, and we keep the structure of text for you All these three API follow very similar pattern So if you know how to use one of the API, it’s very easy to use the rest of the API And all the computation of these API happens on your device So you don’t need to worry about the network bandwidth Now that you know how to use the API, here are some tips about how to make best use of API The first tip we have is, always run the detector in background threads This is because the latency of these three APIs are higher than 16 milliseconds So if you put it in your UI thread, your UI will be laggy, and you will have a very bad user experience Second one is that, if possible, to image pre-processing For example, if you are running face API on a video, and you know that part of the video is blurred– it could be a motion blur, or the light is really dark so its difficult to recognize a face, you might want to skip these frames Because by doing that, you can save a lot of battery And in order to make best use of the mobile vision API, you can also use it together with Cloud Vision API For those who don’t know much about Cloud Vision API, that is another vision-related API developed by Google, runs on Google Cloud, and provides more information for detected items For example, Cloud Vision face API provides emotion detection, while mobile Vision API doesn’t It could tell you if the person is joy or not But Cloud Vision API has higher latency because it requires you to do a network round trip to cloud So say if you want to detect the emotion of a person inside the video, you can combine mobile vision API and Cloud Vision API together Use mobile vision API as a coarse detector, or pre-processor, runs on every single frames you have to detect if there is a face inside the frame or not, or to detect if the face is large enough for Cloud Vision API to do emotion detection or not Only if you find a large enough face inside the frame, you pass that frame to Cloud Vision API to do the emotion detection By doing this, you can reduce the latency of your app, and you can get the same result as using Cloud Vision API slowly I’m going to play a video, might give you a better idea about what can we do by combining this local detection and cloud detection together [VIDEO PLAYBACK] [END PLAYBACK]

Thanks [APPLAUSE] Inside this demo, the own device detection is responsible for detect if the object of interest is inside the frame or not If the object of interest is inside the frame, we past that frame to cloud to do a more detailed detection That’s about mobile vision API And here are some useful links for people who might want to know more about the mobile vision API The first link is our official website, which where we put our documentation on And the three Codelabs are– we developed for the Android API These Codelabs have step by step instruction from how to create a Android project, to how to use the API, to how to build the app So if you are new to Android development and you wanted to try our API, these three links are definitely very helpful And we mentioned to GitHub repositories, one for Android, and one for iOS We put some sample code there So if you want to know how we use the API, you can check these two GitHub repositories And if you find any bug inside the API, feel free to open issue in the GitHub repository In case you have some general questions, like how to use the API, or why I got this exception, what does it mean, feel free to asking a question under the Android Vision or Google iOS Vision tag on Stack Overflow We have developers monitor these tags periodically and answer questions So mobile vision API has powered more than 15,000 apps, and with the cool feature of detecting face, recognize barcode, and text Now it’s waiting for you to think of how to use this feature inside your app On behalf of mobile vision team, Hsiu and I and Hsiu’s dog would like to thank you all for coming to this presentation [APPLAUSE]