Most of the people designing these algorithms and devices have come out of universities where they were trained in theoretical algorithms, OpenCV, university data sets and team projects. They are not trained in measuring living rooms, coffee shops, parties, triathlons and rain storms.
Most HCI algorithms (Computer vision, motion/positioning, speech, ...) work great in the lab, but fail when deployed as a full scale product or brought into the wild for consumers.
Developers usually start with a theoretical mentality and a data set which was recorded by universities or by "well mannered" employees during early development. This means that when it is then used during beta testing, completely different issues arise which are not accounted for in the architecture of the algorithms.
Real world applications never looks as clean as official data sets. |
- Google's classification algorithms classified people with dark skin as "gorillas" because they weren't trained on enough instances on non-caucasian humans.
- A Tesla autopiloted car drove itself under a big-rig, killing the driver because it didn't expect as big of a gap between wheels.
- A crime fighting robot ran over a toddler at a mall because it thought people were taller.
- Microsoft's chatbot Tay was trained by external people to be racist after it was released.
Below are 5 takeaways I have seen help in creating real-world algorithms for HCI product. A lot of this is from messing it up myself or learnings one of my teams had as we developed products over the last decade.
1. Embrace the noise: Use EVERYTHING you record
Visual tracking... of the entire environment
At Sticky, our webcam eye tracking algorithms are the most accurate in the world not just because our tracking of the face, eyes, pupils, etc is the best it can be, but because we also take into account movement, lighting, head pose, subtle facial features and a myriad of other dimensions in the algorithm which maximize the ability to track people in the real world.By tuning our algorithms to the environment people are really in, we were able to increase our data quality/accuracy by 8x. |
Brainwaves, heart rate, movement and all the noise
At EmSense, when we started developing our integrated EEG, accelerometer, pulse-ox, EMG, etc headset for video games, the algorithms worked great when people sat still, but if they moved at all, or even if a person across the room moved, we would get interference. The reason was that our amplifiers had 35 nano-volt resolution, so literally waving your hand 5 feet away changes the voltage of the EEG sensors on someone's head. For the accelerometer, leaning back in the chair, or having a train go by impacts quality also. With pulse-ox, moving around adjust the sensor on the head which are separate from blood flow.The end result is we built a product used by EA, THQ, Activision and many other top tier studies to predict the performance of their video games in market and optimize the experience to meet people's expectations.
To predict the engagement of video games, really simple metrics like this actually worked in addition to complex integrated dimensions:
- How often did someone blink? Did they move their head? (The less often the more engaged)
- Was there a delta from resting heart rate? (Adrenaline inducing event increases heart rate)
Much more complex algorithms around EEG, heart rate variability, muscle tension were also combined with segmentation of the actual experience. This enabled predictions of the impact of each element
2. Classify/segment the world before analyzing it
Machine learning seems like a magic tool but if not used correctly (It isn't usually used correctly), the wrong data is analyzed and trained on creating a poor predictor of the real world.
Therefore, it is really important to understand, analyze and prepare your data inputs in a logical way beforehand and be able to explain what the SVM or other algorithm is actually doing. Otherwise, it is probably doing something you don't want it to.
One of the most famous examples of this was a paper where they thought they had trained neural networks to segment tanks correctly from a background image, but what they didn't realize is that all of the images of tanks were taken on a single day which was cloudy and the forest was taken on another day. The paper actually made it through publication before this was identified. The result was they actually made a classifier of the weather.
Understand people's behavior and their world
For Fitbit, when we were designing an algorithm and API for geo-tracking and segmentation of people's lives, it became very clear we had to create a classification or understanding of the world before we could extract valuable data from the details.The reason is that complex outliers would end up dominating our results so the best and worst rated days would happen not because of true fitness but because of tangential data errors:
- Spurious readings from people going into tunnels and GPS updating with large jumps that looked like car rides would create "events" that seemed important but actually needed to be removed.
- Some days, people would have good data on a fitbit but where the phone (GPS) was left in an office or whole days of missing fitbit data where the GPS data would exist. This would make us rate that day as a 0 day.
- Some office buildings are large enough to be a university campus, so it would seem like someone was outside taking 12 walks even though they were inside going from meeting to meeting on a day.
- Shorter people on a jog may look like taller people on a walk.
- Part of the development team was in Ukraine and GPS is significantly less accurate than in San Francisco resulting in almost unintelligible raw data. Very different smoothing and clustering was necessary.
- Buses in traffic can look like a stationary location or walking based on just GPS.
- Multiple events can happen within a single minute of time which are contradictory - getting on a bus, walking out a door, having your cell phone on you for 30 seconds of that time.
- Additionally, given the smartphone's very small battery, it isn't possible to leave high resolution GPS on at all times or else the battery dies in 3-4 hours so a set of location hacks had to be implemented to lower the system's power drain by 10x.
The system then then ran the real algorithms and created segments of the day which were the same type of activity/location/etc and used the improved raw data sets to calculate the final result.
By correctly taking into account the randomness of the environment, the system can cleanly segment and classify the actual experience of a user. |
The reason why we knew all of these things very early on is that FitBit does a great job at field testing new algorithms and systems in the rough and real world enabling the team to start with multiple real multi-week data sets of full data right at the start of the project.
Segment signals at a high level to understand their context before analyzing
Especially which chaotic signals like EEG or 3D depth maps, understanding why and how you are segmenting objects really matters for creating a stable output. Many times the output from a sensor may be identical in two different instances because of a weakness in the sensor. Therefore you need to have a 3rd dimension to identify how to interpret the output.
BUT... if someone has their eyes open vs closed, the alpha frequency content changes drastically due to multiple factors for visual processing. Therefore, if an algorithm didn't know that the eyes were closed, it would assume that the person just relaxed completely and changed their state. Instead, the actual answer is they shut their eyes and are still in the same state. There is just an offset in amplitude of frequencies.
3. Calculate "confidence" in your result as a core part of your algorithm
One of the most embarrassing failures I have seen is when algorithms fall apart and can't compute the right answer, but the implementation doesn't flag this so a number comes out but it is completely wrong.What if you get incomplete data?
An example of this was Emotiv's GDC demo on their headsets (I think it was in 2008). They bought out a huge auditorium and invited 100's of people. When the demo started, it seems to kind of work, but not really and then characters started jumping around and it stopped working.The end reason was that their algorithm relied on the timing alignment of packets of data from their headset (For frequency analysis, etc) but the wireless microphones they were using for the presentation interfered with data transmission so a subset of the packets were lost and the algorithm did not know, resulting in the entire presentation and introduction of their product being a failure.
Track confidence end-to-end
At Sticky, our API automatically calculates the convergence of the data and the confidence in the result for understanding engagement. This is very important as many data sets can have a result which is different between two elements, but statistically the same. To say that the one which is slightly higher is a winner would be incorrect.Many algorithms output just a single number as a result. This leads to a significant weakness in the stability of outputs as humans can interpret results which are actually all the same to be differentiated with a clear winner.
The same thing can happen with code that interprets an output. It is massively important that a system understands it confidence and differentiation in each dimension it analyzes.
Speech recognition needs confidence
One of the funniest things I have seen Amazon's Alexa do is when it doesn't understand what you said, it replies "I will add [object] to your list." They have chosen a default action that is triggered if there is not enough confidence that something else was said.
Instead, it would be better to focus on modeling the confidence in a statement and working to reply in a subtly human way to not just say "I didn't understand that", but to identify the words that were understood and use that to move the conversation forward:
- "I missed some of the words you said, but was the gist that you would like an Uber sent to John's house?"
- "I missed some of the words you said. You said you want to order a pepperoni pizza. What was the second one you said?"
If the confidence in each measurement is attached to it, statements with partial confidence can actually be used successfully, just as humans do.
Note: I have seen the number of times this happens go down massively since Alexa came out.
Takeaway: Track confidence all the way
The best way to implement this is to think of every output from an algorithm as both a value and a confidence interval that is passed forward. If the system understand this, it can both make automated guesses and also inform the user when it does not have enough confidence to continue.
4. Teach the user to be successful
Every new type of technology has a period of time where its use transitions from novel to part of the human vocabulary. Until people inherently understand how it works, they will accidentally make it fail on a continuous basis.
This is why video game accessories have been a scary prospect for the last 20 years. Even if they work perfectly, there are many cases of the support cost for users calling in to figure out how to plug it in, turn it on and hold it actually being higher than the revenue from the product.
Therefore, sometimes, rather than trying to perfect an algorithm to work every time, it is easier and more correct to train a user to not create an environment where the algorithm will fail.
Therefore, sometimes, rather than trying to perfect an algorithm to work every time, it is easier and more correct to train a user to not create an environment where the algorithm will fail.
- Optical mice don't work on many reflective surfaces, but we don't notice because we are trained to adjust quickly.
- Cameras don't work when pointed at the sun, but we move to get a better picture.
- Websites freeze many times per day and fail, but we just click to reload them.
- Kinect tells you that it can't find you in the image or to come closer.
The challenge will be with this new technology we need to create a language to describe augmented reality interactions and a vocabulary both of standard ways to interact and also ways to fix the issues that arise.
The problem is what happens if someone walks into the room with a shirt or object colored roughly the same as the controller you are using? Or what if the lighting changes?
Therefore, as part of the EyeToy setup, a simple gimmick was added where people center their colored controller on the screen and the screen memorizes the color histogram relative to the rest of the screen. Not only did it remove the need for automatically object detection, but it also ended up training people to understand that they had to have good enough lighting and positioning to make the games successful. They could even use a tennis ball to control the games if they lost their other controller.
This then allowed the algorithm to focus on efficiently tracking the single object at high frame rates without massively impacting game rendering frame rates (A key for video games) rather than also searching and trying to recognize other objects.
The world has come a long way in 16 years and what was complex in 2001 is now trivial in 2016, but with higher expectations the computation power necessary to perform 3D real-time analysis of scenes, voice recognition, etc is more than 100x higher than it was then and growing rapidly. Rather than expecting our algorithms to be 100% magic, having the human involved can be the difference between success and failure.
Red shirts can ruin a Playstation demo
When I was interning at Playstation with Rick Marks in 2001, we were developing demos and use cases for the new EyeToy. The cameras were 640x480 RGB and did not have depth sensing or any of the high resolution sensors we have now. Therefore, tracking objects was a major challenge. In some lighting, a red object might look grey or even blue. The actual controllers we had were colored balls or retroreflective strips. Really basic by today's standards.The problem is what happens if someone walks into the room with a shirt or object colored roughly the same as the controller you are using? Or what if the lighting changes?
Therefore, as part of the EyeToy setup, a simple gimmick was added where people center their colored controller on the screen and the screen memorizes the color histogram relative to the rest of the screen. Not only did it remove the need for automatically object detection, but it also ended up training people to understand that they had to have good enough lighting and positioning to make the games successful. They could even use a tennis ball to control the games if they lost their other controller.
Calibrating the algorithm for EyeToy took into account the weakness of color/object recognition, teaching people to easily calibrate it themselves rather than trying to automatically figure it out. |
Rick demoing EyeToy spell casting in 2001. |
Build it in to your experience
Therefore, when designing algorithms, actually taking into account how you can correctly manipulate the user to make them successful can make a hard problem much easier rather than trying to work around it.
5. Design for your use case specifically
Many algorithms are designed to be the newest/greatest/perfect segmenter or analyzer of a data set rather than to match the actual use case. This isn't always the best way to do it.
A great example of someone doing this right is Leap Motion. They used a simple stereo-pair of cameras and infrared LEDs (Been around for 20+ years) and rather than trying to create a 3D image of the whole environment, they used the LEDs to illuminate just the hands of the person and started by filtering out everything else they could. From the ground up, they designed both the hardware and algorithms to do one thing and do it well.
A great example of someone doing this right is Leap Motion. They used a simple stereo-pair of cameras and infrared LEDs (Been around for 20+ years) and rather than trying to create a 3D image of the whole environment, they used the LEDs to illuminate just the hands of the person and started by filtering out everything else they could. From the ground up, they designed both the hardware and algorithms to do one thing and do it well.
Leap Motion has a very high resolution measurement of people's hands due to correctly linked hardware and software designed to take advantage of each other. |
In the past at the MIT Media Lab Robotic Life Group, I used Point Grey's stereo pair cameras where we were doing interaction analysis on a room-level and also at hand-level. It was never possible to get as high of accuracy as we would have liked because the system was designed to measure everything in the room and then we had to segment and filter all the elements we didn't want. This caused significant aliasing, rough edges and much lower resolution data.
A 3d stereo pair or Kinect has a much lower accuracy and detail level because the hardware and software are a general solution - not tuned to a specific task. |
By choosing to narrowly solve a problem and pair both the algorithms and hardware, Leap Motion was able to get 100x+ performance and accuracy out of a much cheaper system.
A parting thought on big data and schools
Just like we teach people geometry to enable them to understand logic, we now need to start people off young understanding what real world digital data means - both for their professional careers and also from a personal side to understand how to keep their own data safe.
It would be great to see us train the next generation on how to understand the world digitally and realistically - without cleaned data sets, without simplified word problems and without hiding the (very cool) realities of what we know about the world.