We sought to explore how much better participants could understand intelligent, decision-based applications when provided explanations. In particular, we investigated differences in understanding and resulting trust when participants were provided with one of four types of explanations compared to receiving no explanations (None). The four types of explanations are in terms of answers to question types:
- Why did the application do X?
- Why did it not do Y?
- How (under what condition) does it do Y?
- What if there is a change W, what would happen?
We showed participants an online abstracted application with anonymous inputs and outputs and asked them to learn how the application makes decisions after viewing 24 examples of its performance. Of the 158 participants recruited, they were evenly divided into groups where some received one of the four types of explanations and one group received no explanation. We subsequently measured their understanding by testing whether they can predict missing inputs and outputs in 15 test cases, and asking them to explain how they think the application reasons. We also measured their level of trust of the application output.
We found that participants who received Why and Why Not explanations better understood and trusted the application more than How To and What If.
Method
To simulate a context-aware application, we defined a wearable device that detects whether a user is exercising, based on three sensed contexts of Body Temperature, Heart Rate, and Pace. It uses a decision tree to make decisions.
Since we are concerned about the information of the application and not the user interface, we show participants a simple “black box” user interface (UI). It just shows inputs and outputs.
We found that with the concrete domain information, users were depending a lot on their prior knowledge of how a device could detect exercise, and were not gaining much from various explanations the application provides. So we abstracted away the information, and represented the inputs and outputs as anonymous symbolic labels. The following figure shows the UI with the anonymized model.
To allow participants to learn how the application makes decisions, we showed them the black box interface with 24 examples of different inputs and resulting outputs. Some participants received one of the four explanation types (Why, Why Not, How To, What If), and some in the baseline condition received no explanations (None). Why and Why Not explanations are just textual representations, while How To and What If explanations require users to interact with to create a query.
Why | Output classified as b because A = 5, and C = 3 |
Why Not | Output not classified as a because A = 5, but not C > 3 |
How To | ![]() |
What If | ![]() |
None | No Explanation provided |
Screenshot of an example during the learning phase showing an example explanation, and a note-taking text box (for participants’ convenience):
After the learning phase, participants are tested on their understanding with a two-part quiz. First, they answer 15 test cases where at least an input or an output is blank, and they are to fill in the value they think should fit.
Secondly, they are shown three complete examples and are asked to explain how they think the application decided on the outputs.
Fill-in-the-Blanks Test | Reasoning Test |
![]() |
![]() |
Measures
We are interested in measuring the extent of the participants’ understanding, and how that impacts their task (quiz) performance and trust of the application output.
Performance | Quiz answer correctness |
Quiz completion time | |
Understanding | Posited reasons |
Why and Why Not reasoning | |
Mental Model of how application decides | |
Trust | Reported trust |
User reasoning and mental models were coded with the following coding scheme:
|
Guess / Unintelligible | No reason given, guessed, or reason incoherent |
|
Some Logic | E.g. inputs odd/even, one input largest |
|
Inequality | Mentioned inequalities, but wrong values or relations |
|
Partially Correct | At least one wrong or extraneous inequality |
|
Fully Correct | Two correct inequalities |
Participants
We recruited 53 participants for Experiment 1 and 158 participants for Experiment 2 from Amazon Mechanical Turk.
Results
For the concrete application about exercise detection, the results indicate that participants understood the application decision better when provided with explanations. However, there is no significant differences across explanation types.
![]() |
![]() |
When more explanation types were tested with the abstract application, we can see that users understood Why and Why Not explanations more than How To and What If explanations, and this similarly affected their trust.
![]() |
![]() |
![]() |
![]() |
Discussion
Why vs. Why Not
Examining the user reasons, we found that automatically generated Why explanations allowed users to more precisely understand how the system functions for individual instances compared to Why Not explanations. Why Not participants tended to learn only part of the reasoning trace, and did not associate the two rules together, but treated them separately. This failure in rule conjunction could be due to the inclusion of negative wording (i.e. “but” and “not”) in the Why Not explanation. The mental effort to understand the Why Not explanation and create such a rule conjunction is certainly more than those in the Why condition had to expend, which could explain the differences we observed.
Why & Why Not vs. How To & What If
We believe participants did not perform as well for the interactive explanation facilities due to their unfamiliarity and the complexity of the interfaces. Some participants indicated that they did not understand how to use them.
Our results suggest that developers should provide Why explanations as the primary form of explanation and Why Not as a secondary form, if provided. Our results may suggest the ineffectiveness of How To and What If explanations, but these intelligibility types may be more useful for other types of tasks, particularly those relating to figuring out how to execute certain system functionality, rather than interpreting or evaluating.
Impact of Prior Knowledge
We found in Experiment 1 (concrete application) that participants formed less accurate and precise mental models of the system, compared to those in Experiment 2 (abstract). This could be due to participants applying their prior knowledge of exercising to understanding how the system works and not paying careful attention to the explanations, as evidenced by the reasons they provided.
From the Lab to the Real World
Even though Why Not explanations were not as effective as Why explanations, we believe such explanations would be important with real-world applications. In reality, users would ask Why questions when they lack an understanding of how the application works, but Why Not questions when they expect certain results that the application did not produce.
Implications for Context-Aware Applications
(See publication).
Publication
Lim, B. Y., Dey, A. K., Avrahami, D. 2009.
Why and Why Not Explanations Improve the Intelligibility of Context-Aware Intelligent Systems.
In Proceedings of the 27th international Conference on Human Factors in Computing Systems (Boston, MA, USA, April 04 – 09, 2009). CHI ’09. ACM, New York, NY, 2119-2128. DOI=10.1145/1518701.1519023.
Best Paper Honourable Mention (Top 5%).
Leave a Reply