Why HCI products fail: 5 ways to do environment-first algorithm design

With VR and AR products coming to the mass market for the first time, we are going to run into a new generation of "almost awesome" experiences. Many of these will fail, or be delayed in going to market because the algorithms that power the human computer interaction won't work well enough in the real world. We need to change how we design HCI algorithms and systems to think environment and real-world first.

Most of the people designing these algorithms and devices have come out of universities where they were trained in theoretical algorithms, OpenCV, university data sets and team projects. They are not trained in measuring living rooms, coffee shops, parties, triathlons and rain storms.

Most HCI algorithms (Computer vision, motion/positioning, speech, ...) work great in the lab, but fail when deployed as a full scale product or brought into the wild for consumers.

Developers usually start with a theoretical mentality and a data set which was recorded by universities or by "well mannered" employees during early development. This means that when it is then used during beta testing, completely different issues arise which are not accounted for in the architecture of the algorithms.
Real world applications never looks as clean as official data sets.
Some painful algorithm failures in the last few year:
We need to be teaching developers how to create algorithms from the beginning which assume the world is a challenging place to measure and embrace the weaknesses of each sensor fully.

Below are 5 takeaways I have seen help in creating real-world algorithms for HCI product. A lot of this is from messing it up myself or learnings one of my teams had as we developed products over the last decade.

1. Embrace the noise: Use EVERYTHING you record

Visual tracking... of the entire environment

At Sticky, our webcam eye tracking algorithms are the most accurate in the world not just because our tracking of the face, eyes, pupils, etc is the best it can be, but because we also take into account movement, lighting, head pose, subtle facial features and a myriad of other dimensions in the algorithm which maximize the ability to track people in the real world.
By tuning our algorithms to the environment people are really in, we were able to increase our data quality/accuracy by 8x.
People surf the internet with the lights off, walking around, on their bed, watching TV, etc - and the algorithm still has to work.

Brainwaves, heart rate, movement and all the noise

At EmSense, when we started developing our integrated EEG, accelerometer, pulse-ox, EMG, etc headset for video games, the algorithms worked great when people sat still, but if they moved at all, or even if a person across the room moved, we would get interference. The reason was that our amplifiers had 35 nano-volt resolution, so literally waving your hand 5 feet away changes the voltage of the EEG sensors on someone's head. For the accelerometer, leaning back in the chair, or having a train go by impacts quality also. With pulse-ox, moving around adjust the sensor on the head which are separate from blood flow.
Many signals need to be decomposed into multiple dimensions to split noise from signal and also isolate valuable deviations which predict a user's state. The biggest challenge for EEG is extracting stable data when users are moving and interacting with the real world.
One of the most interesting learnings was with respect to blinks and movement. In the beginning, we had set the goal to remove all noisy data so we would have a clean EEG trace for analysis. What we learned through testing across our first 100's of people was that in many cases, the noise was equally predictive with the brainwave content because it was directly correlated to responses to a game.

The end result is we built a product used by EA, THQ, Activision and many other top tier studies to predict the performance of their video games in market and optimize the experience to meet people's expectations.

To predict the engagement of video games, really simple metrics like this actually worked in addition to complex integrated dimensions:
  • How often did someone blink? Did they move their head? (The less often the more engaged)
  • Was there a delta from resting heart rate? (Adrenaline inducing event increases heart rate)
Much more complex algorithms around EEG, heart rate variability, muscle tension were also combined with segmentation of the actual experience. This enabled predictions of the impact of each element 
By assuming a noisy environment, the EmSense EmBand took advantage of the environment to maximize both the quality and stability of sensing.

2. Classify/segment the world before analyzing it

Machine learning seems like a magic tool but if not used correctly (It isn't usually used correctly), the wrong data is analyzed and trained on creating a poor predictor of the real world.

Therefore, it is really important to understand, analyze and prepare your data inputs in a logical way beforehand and be able to explain what the SVM or other algorithm is actually doing. Otherwise, it is probably doing something you don't want it to.

One of the most famous examples of this was a paper where they thought they had trained neural networks to segment tanks correctly from a background image, but what they didn't realize is that all of the images of tanks were taken on a single day which was cloudy and the forest was taken on another day. The paper actually made it through publication before this was identified. The result was they actually made a classifier of the weather.

Understand people's behavior and their world

For Fitbit, when we were designing an algorithm and API for geo-tracking and segmentation of people's lives, it became very clear we had to create a classification or understanding of the world before we could extract valuable data from the details.

The reason is that complex outliers would end up dominating our results so the best and worst rated days would happen not because of true fitness but because of tangential data errors:
  • Spurious readings from people going into tunnels and GPS updating with large jumps that looked like car rides would create "events" that seemed important but actually needed to be removed.
  • Some days, people would have good data on a fitbit but where the phone (GPS) was left in an office or whole days of missing fitbit data where the GPS data would exist. This would make us rate that day as a 0 day.
  • Some office buildings are large enough to be a university campus, so it would seem like someone was outside taking 12 walks even though they were inside going from meeting to meeting on a day.
  • Shorter people on a jog may look like taller people on a walk.
  • Part of the development team was in Ukraine and GPS is significantly less accurate than in San Francisco resulting in almost unintelligible raw data. Very different smoothing and clustering was necessary.
  • Buses in traffic can look like a stationary location or walking based on just GPS.
  • Multiple events can happen within a single minute of time which are contradictory - getting on a bus, walking out a door, having your cell phone on you for 30 seconds of that time. 
  • Additionally, given the smartphone's very small battery, it isn't possible to leave high resolution GPS on at all times or else the battery dies in 3-4 hours so a set of location hacks had to be implemented to lower the system's power drain by 10x.
Even though tracking algorithms work in the lab, with real-world anomalies, the errors mask the actual state of the person. This is raw data from a person sitting in their office for 6 hours coding. The data looks like they left the office multiple times.
The result of this was the team had to implement a multi-stage filtering and clustering algorithm which removed outliers and conservatively filled missing elements to create a base data set for the actual algorithm to analyze. In doing this, incorporating filters based on work place, home, travel routes and other key predictive elements enabled additional cleaning of data at a quality level which would not have been possible just looking at raw data.

The system then then ran the real algorithms and created segments of the day which were the same type of activity/location/etc and used the improved raw data sets to calculate the final result.
By correctly taking into account the randomness of the environment, the system can cleanly segment and classify the actual experience of a user.
The reason why we knew all of these things very early on is that FitBit does a great job at field testing new algorithms and systems in the rough and real world enabling the team to start with multiple real multi-week data sets of full data right at the start of the project.

Segment signals at a high level to understand their context before analyzing

Especially which chaotic signals like EEG or 3D depth maps, understanding why and how you are segmenting objects really matters for creating a stable output. Many times the output from a sensor may be identical in two different instances because of a weakness in the sensor. Therefore you need to have a 3rd dimension to identify how to interpret the output.

7-12 hz energy in EEGs changes drastically both with how much people think and also with their eyes being open or closed. Therefore, both dimensions need to be measured to create a stable predictor of cognitive state.
One example is understanding the frequency content of EEG. One of the simplest measure of EEG is the alpha wave content (7-12 hz). It is an indicator of cognitive load (How much someone is thinking) which is both easy to compute with a Fourier transform and also predictive of things in the real world.

BUT... if someone has their eyes open vs closed, the alpha frequency content changes drastically due to multiple factors for visual processing. Therefore, if an algorithm didn't know that the eyes were closed, it would assume that the person just relaxed completely and changed their state. Instead, the actual answer is they shut their eyes and are still in the same state. There is just an offset in amplitude of frequencies.

3. Calculate "confidence" in your result as a core part of your algorithm

One of the most embarrassing failures I have seen is when algorithms fall apart and can't compute the right answer, but the implementation doesn't flag this so a number comes out but it is completely wrong.

What if you get incomplete data?

An example of this was Emotiv's GDC demo on their headsets (I think it was in 2008). They bought out a huge auditorium and invited 100's of people. When the demo started, it seems to kind of work, but not really and then characters started jumping around and it stopped working.

The end reason was that their algorithm relied on the timing alignment of packets of data from their headset (For frequency analysis, etc) but the wireless microphones they were using for the presentation interfered with data transmission so a subset of the packets were lost and the algorithm did not know, resulting in the entire presentation and introduction of their product being a failure.

Track confidence end-to-end

At Sticky, our API automatically calculates the convergence of the data and the confidence in the result for understanding engagement. This is very important as many data sets can have a result which is different between two elements, but statistically the same. To say that the one which is slightly higher is a winner would be incorrect.

Many algorithms output just a single number as a result. This leads to a significant weakness in the stability of outputs as humans can interpret results which are actually all the same to be differentiated with a clear winner.

The same thing can happen with code that interprets an output. It is massively important that a system understands it confidence and differentiation in each dimension it analyzes.
The above graph shows the amount of visual engagement (Actual seconds of eye tracking on each ad across an audience) for 4 Jose Cuervo ads. The best ad received 75% longer engagement which will lead to a significantly more successful campaign. The second most successful ad (One from the right side) is actually statistically identical (Look at the error bars).

Speech recognition needs confidence

One of the funniest things I have seen Amazon's Alexa do is when it doesn't understand what you said, it replies "I will add [object] to your list." They have chosen a default action that is triggered if there is not enough confidence that something else was said.

Instead, it would be better to focus on modeling the confidence in a statement and working to reply in a subtly human way to not just say "I didn't understand that", but to identify the words that were understood and use that to move the conversation forward:
  • "I missed some of the words you said, but was the gist that you would like an Uber sent to John's house?"
  • "I missed some of the words you said. You said you want to order a pepperoni pizza. What was the second one you said?"
If the confidence in each measurement is attached to it, statements with partial confidence can actually be used successfully, just as humans do.

Note: I have seen the number of times this happens go down massively since Alexa came out.

Takeaway: Track confidence all the way

The best way to implement this is to think of every output from an algorithm as both a value and a confidence interval that is passed forward. If the system understand this, it can both make automated guesses and also inform the user when it does not have enough confidence to continue.

4. Teach the user to be successful 

Every new type of technology has a period of time where its use transitions from novel to part of the human vocabulary. Until people inherently understand how it works, they will accidentally make it fail on a continuous basis.

This is why video game accessories have been a scary prospect for the last 20 years. Even if they work perfectly, there are many cases of the support cost for users calling in to figure out how to plug it in, turn it on and hold it actually being higher than the revenue from the product.

Therefore, sometimes, rather than trying to perfect an algorithm to work every time, it is easier and more correct to train a user to not create an environment where the algorithm will fail.

  • Optical mice don't work on many reflective surfaces, but we don't notice because we are trained to adjust quickly.
  • Cameras don't work when pointed at the sun, but we move to get a better picture.
  • Websites freeze many times per day and fail, but we just click to reload them.
  • Kinect tells you that it can't find you in the image or to come closer.

The challenge will be with this new technology we need to create a language to describe augmented reality interactions and a vocabulary both of standard ways to interact and also ways to fix the issues that arise.

Red shirts can ruin a Playstation demo

When I was interning at Playstation with Rick Marks in 2001, we were developing demos and use cases for the new EyeToy. The cameras were 640x480 RGB and did not have depth sensing or any of the high resolution sensors we have now. Therefore, tracking objects was a major challenge. In some lighting, a red object might look grey or even blue. The actual controllers we had were colored balls or retroreflective strips. Really basic by today's standards.

The problem is what happens if someone walks into the room with a shirt or object colored roughly the same as the controller you are using? Or what if the lighting changes?

Therefore, as part of the EyeToy setup, a simple gimmick was added where people center their colored controller on the screen and the screen memorizes the color histogram relative to the rest of the screen. Not only did it remove the need for automatically object detection, but it also ended up training people to understand that they had to have good enough lighting and positioning to make the games successful. They could even use a tennis ball to control the games if they lost their other controller.

Calibrating the algorithm for EyeToy took into account the weakness of color/object recognition, teaching people to easily calibrate it themselves rather than trying to automatically figure it out.
This then allowed the algorithm to focus on efficiently tracking the single object at high frame rates without massively impacting game rendering frame rates (A key for video games) rather than also searching and trying to recognize other objects.

Rick demoing EyeToy spell casting in 2001.
The world has come a long way in 16 years and what was complex in 2001 is now trivial in 2016, but with higher expectations the computation power necessary to perform 3D real-time analysis of scenes, voice recognition, etc is more than 100x higher than it was then and growing rapidly. Rather than expecting our algorithms to be 100% magic, having the human involved can be the difference between success and failure.

Build it in to your experience

Therefore, when designing algorithms, actually taking into account how you can correctly manipulate the user to make them successful can make a hard problem much easier rather than trying to work around it.

5. Design for your use case specifically

Many algorithms are designed to be the newest/greatest/perfect segmenter or analyzer of a data set rather than to match the actual use case. This isn't always the best way to do it.

A great example of someone doing this right is Leap Motion. They used a simple stereo-pair of cameras and infrared LEDs (Been around for 20+ years) and rather than trying to create a 3D image of the whole environment, they used the LEDs to illuminate just the hands of the person and started by filtering out everything else they could. From the ground up, they designed both the hardware and algorithms to do one thing and do it well.

Leap Motion has a very high resolution measurement of people's hands due to correctly linked hardware and software designed to take advantage of each other.
In the past at the MIT Media Lab Robotic Life Group, I used Point Grey's stereo pair cameras where we were doing interaction analysis on a room-level and also at hand-level. It was never possible to get as high of accuracy as we would have liked because the system was designed to measure everything in the room and then we had to segment and filter all the elements we didn't want. This caused significant aliasing, rough edges and much lower resolution data.
A 3d stereo pair or Kinect has a much lower accuracy and detail level because the hardware and software are a general solution - not tuned to a specific task.
By choosing to narrowly solve a problem and pair both the algorithms and hardware, Leap Motion was able to get 100x+ performance and accuracy out of a much cheaper system.

A parting thought on big data and schools

Just like we teach people geometry to enable them to understand logic, we now need to start people off young understanding what real world digital data means - both for their professional careers and also from a personal side to understand how to keep their own data safe.

It would be great to see us train the next generation on how to understand the world digitally and realistically - without cleaned data sets, without simplified word problems and without hiding the (very cool) realities of what we know about the world.