Team Knock-Knock Answers the “Who’s There?” Question With AR

Written by Nathan Wilkerson, VP of Engineering | Jan 15, 2019 12:00:00 AM

Filed under:

Metal Toad’s winter 2018 hackathon was all about Virtual Reality and Augmented Reality (VR and AR). At its most basic, AR is taking a camera and adding extra information to it, like PokemonGo or AR Stickers on the Pixel Camera.

The idea
When the theme of VR/AR was announced in September, I was reading an article about vOICe technology and was intrigued. What if we could overlay images on what the camera saw? Then you could have audio-based AR for everyone without needing to look at a screen all the time.

Upon researching this, I discovered that using a sensory substitution technology can take weeks, months, or even years to fully train on—not exactly ideal for a two-day hackathon. But the idea of using AR with audio was stuck in my head, and I wanted to use it.

Eventually, I realized that this idea could offer a solution to an everyday problem: remembering people’s names (something I’m not great at). So I settled on using image recognition to give audio feedback of who you were looking at. Knock-Knock was born.

Implementation
Team Knock-Knock consisted of Vaughn Hawk, Oren Goldfarb, and myself. We didn’t have any knowledge of how to do facial recognition or video processing. But we leaned into what we did know: Amazon Web Services (AWS). We discovered they had a facial recognition service, Rekognition (part of the AWS Machine Learning suite of products)—and, best of all, it took a video stream from their Kinesis Video Stream.

Now that we had tools, we found some tutorials and AWS Training documents. We divided the AWS stack and started building. What we built would eventually look like the illustration below.

Once the AWS infrastructure was running, we set about pair programming on the Raspberry Pi. We put together a simple yet effective device you can use to aim the camera at a person, and the device recognizes their face and reads out their name.

Along the way we decided that returning the name wasn’t enough. We wanted to get information about the person we were interacting with. So we setup a simple database with Metal Toad names and titles. Then when the data came back, it would check Dynamo and read out the name of title as well.

Tough problems
We had two major problems working on this stack.

As soon as the video went to the cloud it entered a black box. We had visibility into some metrics, but trouble shooting where data was blocked or why it was slow.
Kinesis Firehose can be configured with how much data to dump at a time, but the smallest increments were 60 seconds or 1MB—and the Rekognition data wasn’t only a few KB every minute. This meant that people would need to look at the camera for over 60 seconds between name pickups.

We thought about using a different tool, but with limited time, we came up with a different method to speed this up. We doctored some fake json records and directly put into Kinesis Firehose. These records were used to fill up the 1MB buffer and cause Kinesis Firehose to output a file. The fake records were doctored in such a way to allow us to filter them from the real data. This sped up the data to 20 seconds. Still slow, but much better for our demo.

Next steps
If we keep working on this project for real world distribution, there are some problems we encountered that need to be solved:

Lighting conditions can create problems seeing the face, particularly when the target is backlit. In a controlled environment we worked around this problem, but it would be interesting to work through the challenges in a less controlled setting
Data handling needs to be cleaned up. For testing and demo purposes, we had the device say the first person it saw. If there are multiple people present or there is bad data, like not knowing who is pictured, that creates problems. In a controlled data environment we could avoid this, but there are more considerations for real-world use
The voice we had sounded ridiculous. It was a great Python Library that had a pre-generated voice, but it sounded very computer generated, nothing like more natural voices that virtual assistants are presenting us with today
Cost for this in full production could be be prohibitive for large scale use. The most expensive part of this would be the Rekognition service, which costs $0.12/minute. That’s $7.20 an hour, or around $1,756/month if you were to run it 8 hours a day for a month. But as this technology becomes more mature, the cost may go down

Learning for the future
This was the team’s first time developing something that used Kinesis Video Stream, Kinesis, Data Stream, Kinesis Firehose, DynamoDB, and Rekognition all together. Within just 48 hours, we were able to get familiar and comfortable working with these technologies. Knowing that we can leverage our existing AWS expertise to integrate more Amazon services—and do it quickly—gave us all a lot of confidence going forward.

During the demos, Metal Toad’s CTO Tony said, “We have clients who want to try using facial recognition technology, but they're hesitant because it seems to hard." Maybe it was a little hard, but I didn’t really focus on the difficulty. I simply saw a problem that sounded fun to solve, then broke it down to manageable tasks, and worked through each step. We had problems, we worked through them. Working as a team, we found that the difficulty of a problem is irrelevant; it’s all about the process and an Agile approach that makes even large challenges manageable.

Whether we use the facial recognition features again or not, the experience with moving data in the suite of Kinesis tools will be a step forward as Metal Toad continues to build more IoT and data analysis projects for our clients.

View full post