Skip to main content

10. Embodied Interaction

Published onApr 27, 2020
10. Embodied Interaction
·

In the physical world, our interactions are embodied. Much of our communication is not in the words we speak, but in our gestures, movements, and facial expressions. Even before we say a word, our appearance tells much about who we are: faces often reveal gender, age, and race, while clothing and hairstyle convey economic status and cultural affiliations. Our movements communicate. We stand close to people we feel very comfortable with, and we back away from those we find off-putting or aggressive. Our expressions help convey how we feel about what we are saying or hearing, and they provide important cues for others about how to respond. Am I looking intently at you as you speak, or is my attention wandering as I steal glances at my watch?

In previous chapters, we focused on words, looking at ways of visualizing and extracting patterns from text conversations—a form of communication in which gaze, gesture, and other forms of nonverbal expression are for the most part missing. This chapter looks at the issues involved in bringing these inherently physical forms of interaction to the mediated world.

Text is an excellent medium. It is easy and convenient to both read and write; it is good for reviewing one’s words before sending them off; and it affords anonymity if needed. But the information we convey by text is very different from what we communicate by nodding, by looking in a particular direction, and by the subtle changes in our expression as we listen or our voice as we speak. Text is deliberate, and although it can convey nuance beautifully, making it do so requires considerable effort. With our nonverbal physical actions, we express attention, emotion, and so on almost subconsciously; indeed, it requires more effort to conceal our feelings than to reveal them.

Interfaces in which each participant has some visible representation—what I am calling “embodied interactions”—provide a medium for conveying these nonverbal aspects of communication. These representations also convey presence, the feeling that you are “with” the others in a common space. But, along with their richer communication channel and sense of presence, embodied interfaces bring new design challenges and problems. The first challenge is sensing the actions: how does the medium sense the person’s emotions or attention? The second is display: how does the medium convey what was sensed from one person to another?

We will first discuss approaches based on sensing and conveying the participants’ appearance. Most straightforward is video, recording how one person appears and sending this image to another; it transmits emotions, for example, by showing the person’s face and expressions. It holds the promise of recreating the experience of “being there” as faithfully as possible. Yet simple video has a big drawback for interactive communication—it lacks a common virtual space, and thus gaze and its attendant cues of attention and interest are misleading. The ultimate solution is to create a virtual reality (VR) system, which senses the movements of each participant, and re-renders them in a common, virtual space. Along with solving the common space problem, VR opens the possibility of transformation: changing the person’s appearance to mask identity or to explore creative representations. However, VR is burdensome, requiring participants to don special equipment and be in a controlled space; it is fascinating for special purposes, but impractical for everyday communication.

We will next look at approaches that sense nonverbal communication in other ways, ranging from keyboard input to biometric sensing. Here, the participants are represented by avatars and questions about their design, such as the benefits of realism versus abstraction, and the representation of nonverbal cues are central. A problem with most avatar interfaces is the mismatch of input to display: they convey more than the user can control. Humanlike forms in particular portray a tremendous amount of social info, but when the avatar has expressive features (e.g., a smiling face, a slouchy posture) that the user did not prompt, it can be worse than no image at all, for it gives a misleading impression. We will look at an alternative approach: very simple, minimalist avatars that existing input streams can fully control. Our challenge here is designing the interface to make these expressive.

Ultimately, communication is about conveying the thoughts, feelings, and intentions of one person to another. We gather clues by reading—and reading into—people’s words and by scrutinizing their expressions. But whether we gather them at a distance or face to face, from deliberately typed text or a subconscious frown, these cues are still surrogates, indicators of what we want to know, not the thing in itself. What if we could perceive the feelings of others more directly? Arguably, we can, using biosensing techniques such as measuring skin conductivity, heart rate, or, most directly, brain activity.

A communication system that uses biosensed data as its input is radically different from traditional communication forms. Unlike gestures and expressions, which evolved to communicate, these forms of direct sensing did not. Though it may take effort, we can usually master our expressions—and many other common media, such as written text, are far easier to control. In contrast, our bodies’ internal physiological response to—indeed its formation of—our thoughts and feelings is far less manipulable. Does such a medium open a new pathway to empathy, or does it invade the privacy of anyone who uses it? We can design embodied interactions with a wide range of input controls, and the chapter will conclude with a discussion of the social and communicative implications of the range of possible input technologies for controlling online representations.

The Allure and Challenge of “Being There”

People like seeing other people. Though we may complain about crowds, most of us are drawn to places buzzing with human activity. We derive comfort from the presence of others and stimulation from seeing and processing the wealth of information that people’s faces, clothing, and actions provide. When we speak in the physical world, we know who is listening, whether it is one companion or an audience of hundreds.1 We can see if our listeners are attentive or distracted, whether they are nodding in agreement or looking doubtful. In face-to-face conversation, listeners as well as speakers continuously convey information.

Online text conversations are very different. The audience is invisible. There is no shared space, no gestures, and no visible identity. Even in synchronous text chats, where all the participants are online at the same time, there is relatively little sense of presence. You are visibly present so long as you are typing, but quickly fade away if you are inactive, even if you are in fact paying close attention. In some forums, an effectively invisible audience of lurkers makes up about 90 percent of the participants (Nonnecke and Preece 2003). Thus, although text conversations have many benefits, from their simplicity and clarity to their freedom from distracting and potentially misleading social cues, there are also reasons to want interfaces with embodied interactions, ones that approach the richness of being in the presence of other people.


Reducing information—and thus changing the available social cues and mores—is another way for video conferencing to go “beyond being there.” In 2010 a site called “Chatroulette,” a video chat system that connected people with random strangers, enjoyed sudden popularity. You saw yourself in one window and the stranger in the other. You could chat, and either party could disconnect with a “next” button, at which point the system connected you to a new random stranger. Simple as it was, Chatroulette fascinated people: it was the fastest rising search term on Google in 2010. What was its appeal? It resembled the random text chat-rooms that were popular in the early days of online socializing, where people went online using easily replaceable pseudonyms and chatted with strangers, making up identities as they went. Here, there was a voice and face, but the absence of names, the searchable identifiers that tie an online presence to a real-world person, lent it an aura of anonymity.

Chatroulette required little work. You didn’t need to be witty, or make the effort to invent a character, or even type. You just needed to appear, and there was enough interest in the appearance of others to make it work. Of course, not everyone is interested in everyone else. One visiting journalist (Anderson 2010) reported how dispiriting it was to get onto Chatroulette and then be repeatedly “nexted” as people swiftly found nothing of interest in his single, male, older-than-thirty self (his wife was much more successful). Chatroulette’s combination of video and anonymity meant that the deviant behavior of similar online text spaces could play itself out visually and vividly: there were many exhibitionists, seekers of anonymous virtual sex, and people hoping to shock or disgust anyone who chanced to be connected to them. Although Chatroulette’s moment of popularly has ended, it is relevant here because it is an intriguing example of how removing information—in this case, the identity of the other, including any way of building up a history or reputation—creates a very new-seeming medium. The newness here is the opportunity to see and interact with others, minus the ability to affect them or know much of anything about them.


Remote presence captures our imagination. One of the most famous scenes in the Star Wars movies shows Princess Leia as a holographic image, begging Obi-Wan Kenobi for help. This archetypal science-fiction technology also appeared in the 2008 election coverage, when CNN showed a “hologram” of distant guests who appeared on host Wolf Blitzer’s set as if present. Though shown in the supposedly factual world of news coverage, it was fiction: actual holographic presence is well beyond current technology (the guests were actually blue-screened into the video image using numerous camera views; the host simply saw a red dot that helped him direct his gaze as if he were seeing a projection of his guest [Welch 2008]). Although this episode did not display an existing (or even a foreseeably plausible) technology, it did demonstrate our fascination with the dream of realistic telepresence.

In practice, realistic telepresence—faithfully recreating the experience of “being there”—is technologically very difficult. Making a real two- (or more) way holographic communication system would require capturing highly detailed three-dimensional visual information about every participant and transmitting it to all the others, appropriately reconstructed for their perspective. Recording and transmitting the data is challenging but solvable. The problem of creating a common space and perspective, however, presents basic conceptual difficulties.

The fictional “holograms” we have seen in Star Wars and on television dazzle the viewer, but they omit a key element: what does the holographic visitor see? In CNN’s sketch, what was Jessica Yellin, the reporter, supposedly seeing when her hologram stood in front of Wolf Blitzer? To experience “being there,” she would need a representation of everyone at the studio: if that freestanding, three-dimensional “hologram” were real, the distant guest herself would see nothing unless there was equivalent “hologramming” of the host and his surroundings. As depicted, even if Wolf Blitzer actually saw Yellin instead of the red circle on the floor that, in fact, stood in for her, it would be a one-way experience with no shared space, no common ground, no way for gaze to be meaningful.

Indeed, these fictional depictions of telepresence may seem intuitive to the casual viewer, but they are conceptually based on teleportation, a technology that is even more fantastical (and far from achievable) than holographic communication. They function as if what had been transmitted was not simply one’s appearance and sound but a stand-in for the actual body (albeit one with scan-lines and blurry edges).

Teleportation, the ability to conquer distance by instantly appearing in another place, has been a dream of people from ancient to modern times. Several hundred years ago, the genies in the Arabic tale Aladdin and the Magic Lamp transported people instantly from one place to a distant other. More recently, the TV show Star Trek frequently featured teleportation, and in the popular Harry Potter books expert wizards can “apparate”—move instantaneously from one place to another. Yet, outside of fiction, the laws of physics make teleporting humans impossible (Davis 2003).

Holographic communication does not suffer from the fundamental physical impossibilities that plague teleportation, but it is still an extremely distant goal. In the meantime, people do the next-best thing given existing technologies, and use video to recreate the experience of “being there.”

The Next-Best Thing to “Being There”?

The telephone was invented in 1876; three years later, Punch’s Almanack published a cartoon envisioning a “telephonoscope” that would transmit pictures as well as sound (see figure 10.1). Seeing distant friends and family, as well as hearing them, was a popular element in futuristic scenarios. A 1910 French postcard from a series called En l’an 2000 (In the year 2000) shows a man conversing with a far-off woman whose image is projected on a screen (see figure 10.2). Many other predictions of its day still exist only in fantasy, but video telephony arrived early, well before 2000 (Fischer 1994). Prototypes existed in Germany in the late 1930s; the first commercial version, the Picturephone, came out in the 1960s. Although the Picturephone never became successful and was discontinued in the 1970s, there have since been many technological improvements, and today video telephony is commonplace. For business, there are elaborate conferencing rooms with large screens and high bandwidth. For everyday use, webcams make video chat easy and inexpensive, and camera-equipped mobile phones make video calls possible everywhere.

<p>Figure 10.1</p><p>George du Maurier, <em>Edison’s Telephonoscope</em> (1879). From <em>Punch’s Almanack</em> for 1879. The caption reads:</p><p>(Every evening, before going to bed, Pater- and Materfamilias set up an electric camera-obscura over their bedroom mantel-piece, and gladden their eyes with the sight of their Children at the Antipodes, and converse gaily with them through<br>the wire.)</p><p><em>Paterfamilias (in Wilton Place)</em>. “Beatrice, come closer. I want to whisper.”</p><p><em>Beatrice (from Ceylon)</em>. “Yes, Papa dear.”</p><p><em>Paterfamilias</em>. “Who is that charming young Lady playing on Charlie’s side?”</p><p><em>Beatrice</em>. “She’s just come over from England, Papa. I’ll introduce you to<br>her as soon as the Game’s over!”</p>

Figure 10.1

George du Maurier, Edison’s Telephonoscope (1879). From Punch’s Almanack for 1879. The caption reads:

(Every evening, before going to bed, Pater- and Materfamilias set up an electric camera-obscura over their bedroom mantel-piece, and gladden their eyes with the sight of their Children at the Antipodes, and converse gaily with them through
the wire.)

Paterfamilias (in Wilton Place). “Beatrice, come closer. I want to whisper.”

Beatrice (from Ceylon). “Yes, Papa dear.”

Paterfamilias. “Who is that charming young Lady playing on Charlie’s side?”

Beatrice. “She’s just come over from England, Papa. I’ll introduce you to
her as soon as the Game’s over!”

Video telephony has several noted benefits. Seeing the other is a great advantage when there is a deep emotional bond. Here, where one would ideally have the real person to see and hold, the increase in media richness—in getting a picture as well as sound—is valuable (Ames et al. 2010; O’Hara, Black, and Lipson 2006). It is also useful when the other is a stranger, to get a sense of meeting him or her. It is popular with the deaf community, who communicate via sign language. And it does help people understand and react to pauses and other disruptions in a conversation and be more interactive listeners, able quietly to convey understanding and reaction (Hirsh, Sellen, and Brokopp 2005; Isaacs and Tang 1994; O’Hara, Black, and Lipson 2006; Whittaker 1995). Finally, it is useful for showing things to the other person, rather than simply describing them with words.

Yet, while certainly more popular than in the days of the Picturephone, video telephony remains a minor form of communication (Hirsh, Sellen, and Brokopp 2005). This seems, at first glance, to be surprising. The face is the locus of identity, and its expressions, along with our range of other gestures, are richly communicative. The ability to see each other while speaking should greatly improve our interactions. Why, then, is a technology that provides this ability so modestly successful? Is the flaw technical—a problem of cost or bandwidth? Is it that there are too few people to call, the sort of flaw that better and more widespread technology would fix? Or is there a deeper issue in the design and experience of remote video communication?

<p>Figure 10.2</p><p>Villemard, <em>Imagining Video-Telephony</em> as It Would Be in 2000 (1910).</p>

Figure 10.2

Villemard, Imagining Video-Telephony as It Would Be in 2000 (1910).

Conceptually, video telephony is very appealing, but in practice, it is awkward and often not well liked. Video reveals where you are and how you are dressed. When we meet face to face, we are, by definition, in the same place. But on the phone, our situations may be mismatched: you may be in your office, dressed for work, while I am working at home in sweats, with a pile of dirty dishes in the background. In an audio call, I can project a professional aura with a businesslike tone and command of the discussion material, but for a video call, I would need to prepare my appearance and background as well. This demand may be reasonable for some situations, such as when the video call’s purpose is to meet a new person virtually, but it is too onerous for other conversations. Video also constrains our motion. In face-to-face meetings, we can look around the room; sharing the environment, our companions can follow our gaze and understand our shifting attention. A video meeting roots us to our desks, staring at the camera. On an audio-only call, people walk and talk; they wash dishes, water plants, walk to work. Although technically one could participate in a video call while walking around with a mobile phone, as one participant in a study noted, no one wants to “be a prat and walk into a lamp post” (O’Hara, Black, and Lipson 2006).

To achieve a coherent communication, common ground is required. In the physical world, our interactions occur in a shared space. If a passing distraction interrupts your thoughts as you speak, I understand what is happening since I can see it too. The shared environment is a stage for our interactions. We move within the space to enter and leave conversations; we face people to pay attention to them and turn our back on those we want to ignore. In audio calls, people create a common space in their imagination: watch someone talking on the phone as he nods and gestures at his of course unseeing partner. Mobile phone users often struggle to manage their dual existence in the real space of their physical surroundings and the conceptual space of the call (Ling 2002).

Video does create common ground when used specifically to invite the other to see something in one’s own space: a prototype you are both designing, the cute new kitten, scenes from a concert. Here it functions as a one-way transmission of images rather than a conversation, a function that has proved to be relatively successful (O’Hara, Black, and Lipson 2006). When video is used for two-way conversation, however, it introduces a new and conflicting visual space, common to neither (Hirsh, Sellen, and Brokopp 2005). This conflict is most apparent in problems with gaze in video communication.

One of the most important nonverbal cues is gaze.2 Gaze conveys personality. Someone who is shy or otherwise uncomfortable may have trouble making eye contact, whereas someone who is more aggressive or intent on appearing sincere may hold eye contact for uncomfortably long stretches. Gaze can convey aggression or modesty: it shows how willing you are to follow social rules. It is rude, for instance, to focus on your conversational partner’s cleavage, no matter how alluring. Gaze also helps manage conversations, and its choreography goes far beyond simply looking at the person you are conversing with. We use gaze to manage turn-taking, to show that we are considering what to say next, to wordlessly convey that we are listening. In a two-person conversation, the speaker usually looks at her companion only about one-third of the time, while the listener looks at the speaker almost twice as much, but far from a nonstop stare (Argyle 1993). This is partly because looking at another person is cognitively demanding: there is a lot to observe and think about when we see a face. Thus, as we speak, we often look away in order to focus on preparing our words rather than assessing our listener. Though the listener may look away for a moment, if his gaze does not return once the speaker has finished it will seem as if he had ceased paying attention and was lost in his own thoughts.

Gaze shows our reactions to the world around us. If I hear a noise, I look in that direction; my gaze provides a cue about where my attention lies. In person, I barely notice if you look down as you pick up a pencil or glance at the window as a truck goes by; these changes of gaze are unremarkable given what I can see of our shared environment. If we were talking via video, however, you would seem to me to be looking inexplicably away or to have disappeared from the screen entirely. Seen through the limited window of the screen, your ordinary responses to your environment seem disruptively odd. To seem attentive on video, we need to be more still and focused than we normally are in unmediated encounters. We also need to look into the camera, which means that the camera then becomes our focus, not the face of our interlocutor. Video calls tether us to the screen and camera.

If we accept constrained activity, elaborate technological setups can improve mediated gaze (Carson et al. 2000; Dumont et al. 2009; Lanier 2001; Vertegaal, Weevers, and Sohn 2002). In these systems, multiple cameras capture each participant and eye-trackers follow their gaze. The system also knows the location of the other participants’ images on each person’s screen. By tracking where a person is looking, it can deduce whom she is looking at and then, using the images from the various cameras, recreate her image showing her looking at the object of her attention.

Creating a common virtual space remains a challenge. Commercial videoconferencing services achieve the illusion of one by building carefully matched, austere rooms and having the participants travel to these sites for their virtual meetings. Still, distractions and other events in one space are not shared with the others. And, in practice, people choose media that are easy to use. One often-cited reason why high-quality videoconferencing is used so infrequently is that it requires too much effort. People don’t want to go to a special room to make a call (Hirsh, Sellen, and Brokopp 2005); they like technologies that are lightweight and simple.

The ultimate solution to “being there” is immersive virtual reality. Here, you shut out all views of the physical world and see only a synthetic environment, a shared “third space” that is neither your actual surroundings nor your partner’s, but instead an artificial world you both inhabit (Bailenson and Beall 2006). All of your movements must be tracked by sensors (which might be video cameras); if you appear as yourself in this world, your image must have been previously recorded: it is synthetically re-rendered (see figure 10.3).

<p></p><p>Figure 10.3</p><p>Virtual reality suit, eventLAB, University of Barcelona (2011).</p>

Figure 10.3

Virtual reality suit, eventLAB, University of Barcelona (2011).

Yet if we are going to go through all the effort of participating in a highly instrumented and synthetic experience, why limit it to recreating the mundane experience of being there? A virtual environment has infinite possibilities: the “table” can visualize the conversational patterns, there can be interactive objects in the setting, and the people themselves can be transformed (Bailenson and Beall 2006; Yee and Bailenson 2007). Why not explore what else we can do in a fully computational and synthetic environment? Once we have all the data and instrumentation we need to fully recreate “being there,” we also have the potential to go far beyond.

Avatars and the Worlds beyond Being There

Virtual worlds are shared online environments. Participants move through a simulated geography in which they can communicate with each other and affect the environment in various ways. Virtual reality is one form of virtual world—one in which the participants’ every move is tracked and recreated in the space, and where each person is immersed (using gear such as helmets and goggles) so that all they perceive is the synthetic world. But virtual worlds can also be quite simple.

Some are text based; words describe the environment and the users navigate via text commands. Others (the ones we will focus on here) are graphical worlds, in which people appear in the guise of avatars (graphical images representing the user). Avatars can range from quite realistic human forms to abstract shapes. They can provide a sense of presence, expressivity, and other features of embodiment, while allowing identity to be fluid. As a medium for embodied online communication, graphical worlds are the opposite of video: the shared space is inherent to the medium, but appearance is arbitrary, and conveying expression is challenging.

In this section, we will look at virtual worlds that enable various aspects of embodied interaction. Our focus will be on avatar design, on the images and behaviors representing the user and the ways of communicating through them. We will start by looking at existing avatar implementations, both in games, where they are quite popular, and in more “serious” and social settings, which have been rather less successful. Are they inherently suited only for fantasy?

To use avatars as a communicative medium, the key problem we must solve is how to match the expressive capability of the representation with the user’s means of controlling it. I will argue that the problem with typical humanlike avatars is that their appearance and movements implicitly convey a tremendous amount of social information, but the user’s input is very limited. The avatar has eyes: Where should it be looking? What expression should its face have, what mood should its gait convey?

One approach is to program the avatar to be “smarter,” able to carry out complex behaviors programmatically, without detailed instructions from the user. This makes for a more lively and expressive avatar, but whose feelings is it conveying? We will look at the distinction between creating an autonomous character and creating a personal representation.

A different approach is to design simpler avatars, in which the amount of detail in the representation matches the input the user provides. Starting with a very basic interface in which the avatars are colored circles, we will explore ways to build increasingly expressive systems from a simple foundation.

The communicative capacity of an interface ultimately depends on the extent and subtlety of the user’s input to it. If typing is the only input, graphics may illustrate the words, but the user’s message is still bound by what the text conveys. Interfaces that sense other actions—for example, that record and then represent gestures or facial expressions—can convey additional and potentially more candid meaning, especially when they are easier and more intuitive actions than typing words. We will start by exploring ways to use common input devices to convey expressive gesture, and will then examine the issues that arise with measurement of subconscious and private reactions.

Humanlike Avatars

By far the most popular use of virtual worlds and avatars is in online games. Millions of people play MMORPGs, “massively multiplayer online role playing games.”3 In these fictional worlds, players wander about, find companions, seek treasure, fight battles, and so on. Role-playing an imaginary character is not only accepted but required. There is a strong social component to many of these games: the players need to work closely in teams to achieve their goals and sometimes develop friendships that transcend the game.4 However, although these games feature detailed graphics and complex strategy, the usual communication medium is traditional text or audio chat, and the avatar design has focused on role-based costume, not the subtleties of social interaction (see figure 10.4).

There have been repeated attempts to make social (as opposed to quest-oriented gaming) graphical worlds. In the mid-1980s the first commercial one, Habitat, was developed (Morningstar and Farmer 2008). Several, like Worlds Away and The Palace, were created in the mid-1990s as home computer use became increasingly widespread. Though these sites debuted with much excitement, interest in them soon faded. In 2004, a more technologically complex graphical world called Second Life launched. Its users could wander about a three-dimensional landscape and sculpt the face, hair, and clothing of their detailed avatar. Second Life included a virtual economy: users could build and own objects, and buy and sell them in an internal market. As Second Life grew in popularity, enthusiasts predicted that it would be the successor to the Web. Many companies (perhaps remembering how irrelevant they had thought the Web would be when it appeared in the early 1990s, and how unprescient they had seemed when, soon after dismissing it, they needed to buy back the URL with their company’s name for their now clearly obligatory corporate homepage) were quick to jump onto the Second Life bandwagon. They bought up virtual islands, built virtual showrooms, and gave out virtual branded gifts. Years later, Second Life, though still in existence, is fading out of sight. It is likely that at some point it will disappear entirely, or at least drastically transform itself. It is equally likely that in a few years another avatar-based world will appear, again promising to be the ultimate future of online communication.

<p>Figure 10.4</p><p>Avatar (“Undead”) from World of Warcraft, Blizzard Entertainment.</p>

Figure 10.4

Avatar (“Undead”) from World of Warcraft, Blizzard Entertainment.

What makes the concept of an imaginary social world seem so intriguing, yet in practice not particularly compelling, if not outright unappealing, for most people? Is it a matter of getting the technology and design right? Or is there a deeper problem with the concept of inhabiting a virtual avatar?

Avatars let you be anything; an avatar need not, and frequently does not, resemble the user. Whether this free identity is desirable depends on the situation. In the fantasy game worlds that comprise the vast majority of avatar use, the ability to play as an imaginary and fantastical self is a key part of the game. But here, while the avatar is nonrealistic, it is not free form. Role-playing games usually assert strict control over an avatar’s appearance, which usually functions like a professional uniform, displaying its player’s class and other rigidly maintained aspects of role. Only through achieving a series of goals can the player display various badges and other marks of status.

In the online social realm, the value of free-form identity is ambivalent. For those who see these virtual worlds as primarily fantasy spaces, places to try out different imagined characters or characteristics, they can be quite appealing. Notably, there has been considerable use of Second Life by disabled users, who find the ability to be physically attractive and mobile in a world that values appearance to be very empowering (Cassidy 2008). These cases are very vivid, and they epitomize what many people enjoy about the site: the ability to interact in a setting where physical attractiveness counts highly, while having the freedom to specify their own appearance. Others find it less appealing. Chimerical avatars are the visual equivalent of the text world’s “cheap pseudonym” (Friedman and Resnick 2001). Where it is easy to appear as anything you want to be, appearance loses its significance.

Communicating via avatar changes the social dynamics. It introduces an element of fantasy that keeps interactions in the space at a certain remove. When using text alone, you can interact with others while remaining ignorant of their appearance and all that it implies about them. When interacting via avatar, although you are consciously aware that the other may look quite different in real life, having a moving, acting, vivid image in front of you makes it hard not to think of it as a lifelike depiction of the other. This simultaneous belief and nonbelief keeps the space suspended in a limbo between realism and imagination.

<p>Figure 10.5</p><p>Avatar faces from Second Life. We read meaning from faces (Bruce and Young 1998; Donath 2001; Zebrowitz 1997), including from the faces of avatars. Imagine each of these saying: “I have some land to sell to you” or “Do you want to go to a party?” Even though both faces could belong to anyone, and the difference between them was made in a couple of minutes by changing a few parameters in Second Life’s face-editing program, we interpret the words through the context of the personalities we read into these two different, fictional faces. Whereas it is equally easy to make fictional self-descriptions in text, the vividness of visual imagery makes it harder to stand back and remember that it may be a completely imaginary representation.</p>

Figure 10.5

Avatar faces from Second Life. We read meaning from faces (Bruce and Young 1998; Donath 2001; Zebrowitz 1997), including from the faces of avatars. Imagine each of these saying: “I have some land to sell to you” or “Do you want to go to a party?” Even though both faces could belong to anyone, and the difference between them was made in a couple of minutes by changing a few parameters in Second Life’s face-editing program, we interpret the words through the context of the personalities we read into these two different, fictional faces. Whereas it is equally easy to make fictional self-descriptions in text, the vividness of visual imagery makes it harder to stand back and remember that it may be a completely imaginary representation.

Though it is technically possible to have any appearance in an avatar site, stylistic norms exist. Second Life avatars tended to be young and hip looking: the female avatars wore tight jeans and midriff-baring outfits, and the males, who were also in tight-fitting clothes, looked as if they spent hours in the gym every day.5 For the business users eager to embrace Second Life as the next platform for commerce, this presented a dilemma. You go to your corporate job dressed conservatively; it is the rule and the standard. Then you are sent to attend a virtual meeting, for which you need to create an avatar. Because it is for work, you should go as a virtual version of yourself; in this context, it is not a fantasy space. (The colleague who shows up to a virtual business meeting as his favorite after-hours wizard avatar, with blue flowing hair and a wand, will be quite out of place.) However, if you create a too realistic version of yourself, it also could seem oddly, almost freakishly out of place. You certainly do not want to wear the slightly sci-fi, sexy clothes that are the avatar norm, but an avatar in a business suit appears to be in corporate drag. Creating a body is complicated. If in real life you are rather plump, showing up online as a skinny, shapely avatar is problematic. Does it mean you think you look like that, wish you looked like that, or feel that you must portray yourself like that even if you are quite happy with how you actually look? But depicting yourself with your real-life shape can also be awkward. Though it might be quite unremarkable in the physical world, it would stand out as hyper-realistic in this land of air-brushed fantasy. The avatar worlds are fantastical at heart, well suited for situations that thrive on imaginary experience; those that do not can be at odds with their inherent playfulness.

A fundamental problem with contemporary avatars is that they are not very responsive. Avatars have faces, arms, and bodies. We thus expect them to move and interact as people do, but their expressions and reactions are limited. Increasingly realistic avatars, with highly detailed faces and bodies, set up increasing expectations of responsiveness and subtlety that they are not equipped to fulfill. Perhaps most disturbing is their blank stare; set to look in one direction, they do so until the next time the user remembers to move their virtual gaze point.

The source of problem is that the avatars have expressive detail beyond their users’ means to control. A typical computer has a keyboard, mouse or other pointing device, and perhaps a camera. The keyboard and mouse are the primary inputs; they are ubiquitous, and interpreting their simple and unambiguous input is easy. But there is little in the way of graceful and intuitive mapping when going from these devices to moving an avatar. Many systems employ a combination of input techniques; you can control the avatar by typing the name or shortcut for a gesture as part of the text (e.g., “\laughs at a story”) or you can pick it from a menu. This is far from the spontaneous expression of feeling that gestures and expressions communicate when we perform them with our bodies. Interacting via avatar is not so much like inhabiting another body as it is about manipulating a puppet. It may be fun, but it is work; it may be expressive, but it is not spontaneous.

Without either internally generated intelligence or an external applied control, the avatar stands around looking dumb. A person surrounded by other people is seldom still. Even if he is not talking, he is nodding at this one, making room for that one, watching the action over there, smiling at something here. Contemporary avatars, however, simply stand still. They are vacant, like dolls waiting for a child to pick them up and bring them briefly to life. Their appearance is much more sophisticated than the inputs that control their behaviors. To solve the vacant doll problem we can program autonomous behaviors to make smarter avatars or implement new input pathways to create avatars that are more communicative. Or, we can avoid the problem by designing simpler avatars.

Autonomous Behaviors

The simplest avatars are just pictures; they have no movements or behaviors. One moves them by dragging them across the screen. More advanced avatars have algorithms for complex actions built into them. For example, to get the avatar to walk somewhere, the user simply indicates a destination, and the avatar’s walking programs animate its gait. Doing so by hand would be extremely tedious.

<p>Figure 10.6</p><p>Comic Chat (1996). Software by David Kurlander, Microsoft. Artist: Jim Woodring.</p>

Figure 10.6

Comic Chat (1996). Software by David Kurlander, Microsoft. Artist: Jim Woodring.

An avatar can also be programmed to perform social actions. For example, upon greeting another avatar, it could give a slight bow or shake hands, and its face and eyes could be animated to look lifelike when it is speaking. An avatar can have programs for greeting, leaving, appearing raptly interested, or rudely bored. In Cassell and Vilhjálmsson’s Body Chat program, if two avatars were talking and the user of one texted “goodbye,” the program would have the avatar look at the other, nod its head, and wave (Cassell and Vilhjálmsson 1999; Vilhjálmsson and Cassell 1998). Similarly, an avatar can have programs that give it a set of actions to perform when idle (check its virtual phone, take out a magazine and appear to read, do its virtual nails). Microsoft’s Comic Chat not only gave the avatars social expressions, it placed them on the screen as if in a comic book (see figure 10.6). Adopting the sequential art form (Eisner 2001) solved the problem of what to do with the avatars between moments of dialogue: instead of continuous motion, Comic Chat desplayed a series of still images, rendered only as needed (Kurlander, Skelly, and Salesin 1996).

Virtual worlds have markets for behaviors, where one can buy algorithms for one’s avatar to make it bow more gracefully, obsequiously, or minimally. One can purchase a high level of politeness, turn it on, and forget about it. Similarly, one can purchase an aggressive persona or a suggestively flirtatious one. Cheap behaviors are poorly rendered, and truly sophisticated interactions cost quite a bit. Though these algorithmic behaviors make avatar interaction smoother and more entertaining, what can we really learn about our companions through such an interface? In our unmediated encounters, politeness and other social behaviors tell us a great deal about each other. It takes work to be extremely gracious; one must both know the social rules and make the effort to perform them consistently and well. Acting graciously signals one’s knowledge of and commitment to the rules and mores of society.

If the avatars and their algorithmic behaviors are made well, they can evoke personality and character quite convincingly. We form strong impressions of others based on their social actions without necessarily being aware of what has influenced us (Ambady and Rosenthal 1992). However, displaying these traits via a “smart” avatar only shows that the person behind it can afford to buy the social behavior program; it says nothing about whether he has these traits in real life. The “cheap pseudonym” problem becomes one of cheap affect. In a world where politeness and aggression are commercially available styles that one can put on or remove at will, these behaviors have no deeper significance than easy verbal claims like “I’m a nice guy.”

Autonomous conversational agents are not new: chatterbots—programs designed to converse with people and often to try to pass as human—have been taking part in online interactions since the early 1990s (Mauldin 1994) and are increasingly common in commercial settings as companies automate consumer relations. Adding graphics and behavioral mimicry to a chatterbot can make it more convincing, for the avatar’s image and gestures can distract the viewer, and they are easier to synthesize than sustained and believable verbal interaction (a problem that is still unsolved; Gratch et al. 2002).

For games and fantasy spaces, this ability to create vivid characters that not only look but also act their role is exciting. For social interactions, its benefit is more dubious. We respond to the avatar using the social knowledge we have developed through years of interacting with other people, easily forgetting that a program, rather than an emotion or intention, motivates its behavior. Once the avatar is smart enough, there need not be any human acting behind it.

Sometimes we care only about external behavior, and sometimes we care about the motivation behind it. If we think of the other as just an agent acting for our benefit, whether to buy a ticket or to entertain us, then the smart avatar, or perhaps even the autonomous one with no person actually directing it, is desirable. In this case, we want the experience to be smooth and enjoyable, and it is fine if the smiles and nods are merely simulations of sociability.

Yet if the purpose of our interactions with the other is, at least in part, social (we want to get to know another person, create a social tie, or see how others respond to us), then the avatar with simulated social behaviors is a barrier. How long someone holds your gaze, whether they say “thank you” and with what expression, the minute lift or frown of their brow—these innumerable social gestures, both big and small, provide us with cues about the other’s thoughts and opinions that we cannot perceive otherwise. If the gestures are simulated, we may feel that we are getting to know the other, but we are actually responding to a mask.

Simulated and algorithmic behavior can make an interaction seem smoother, but they make it less communicatively reliable. A wholly autonomous, “intelligent” avatar can be quite an entertaining performer, but it conveys little about its user other than the taste she demonstrated by choosing it as her representative. To be communicative, the user must guide the avatar.6

Simpler Avatars in Abstract Spaces

Humanlike avatars have faces and bodies that give off social signals, often disconnected from their users’ intentions. Their familiar forms set up expectations of humanlike behavior that are often unobtainable.7

The less realistic the avatar, the lower our expectations are for it to behave in a humanlike way, and the more likely it is to satisfy these expectations. Simpler avatars have fewer details that require continuous updating. Yet this does not mean that the representation cannot be expressive: even simple lines and shapes can be eloquent. The faces of Charlie Brown and the other characters in the very successful Peanuts series are circles, with dots for eyes and lines for mouths.

In the late 1990s, the increased popularity of online socializing and the growing power of home computers prompted a wave of graphical chat sites. Users chose avatars to represent themselves and then went from room to room looking for people to chat with. A big appeal of these worlds was the sense of presence they provided (Biocca 1997; Nowak and Biocca 2003). In nongraphical text-only chats, one sees other people only when they talk, and though the invisible listeners might be listed somewhere, a list of names does not feel like a group of people surrounding you. In the graphical chat sites, as in real life, everyone present was visible.

Yet the design of these virtual environments was awkward. The avatars—which could be human forms, frogs, cars, hearts—floated like haphazardly placed paper cutout dolls against backdrops such as a living room, a palace throne room, or outer space. Neither the shape of the avatar nor the contents of the room had functional meaning. The avatars had limbs but no gestures, faces but no changing expression. They portrayed vivid characters, but there was little meaning to being a wizard, a queen, a fox, or a little girl when each was just an arbitrary picture. You could be a grapefruit or a judge; it made no difference. The environments, too, were all surface representation without any deeper significance. One could have a setting of medieval riches, depicting thrones and knights and hanging tapestries, or one of urban blight, showing a run-down street with burnt-out buildings and rusting cars; functionally, both were the same. The freedom to be anything and anywhere meant that there was little significance to any of it.

Chat Circles

My students and I created Chat Circles* in response to these avatar worlds (Donath, Karahalios, and Viégas 1999; Donath and Viégas 2002; Viégas and Donath 1999). We liked the idea of having visible presence and a graphical context for interaction, but not the imagery of the avatar worlds. So we set out to design a very simple graphical social space, based on the idea of form following function (Sullivan 1896), in which the visuals would reflect what the user could convey.

<p>Figure 10.7</p><p>Fernanda Viégas and Judith Donath, <em>Chat Circles</em> (1999).</p>

Figure 10.7

Fernanda Viégas and Judith Donath, Chat Circles (1999).

In Chat Circles, the “avatar” was simply a colored circle (see figure 10.7). When you typed, your words filled the circle, which expanded to hold them, and then they would slowly fade and the circle would shrink. To see what other people were saying, you needed to be close to them. If someone was outside of your “hearing range” you would see their circle only as hollow outline. You could not see the words when they spoke, but you could see the circle expanding and contracting. Thus, you could see activity at other parts of the space, but could only participate in conversations near you. Hearing range was symmetric: if you saw someone else as a solid circle (and could thus see his words), you knew he could see yours. If he was just a muted outline to you, then you were the same to him. In The Palace and other avatar worlds of the time, there were seating areas for conversation; but avatars do not need to sit, their virtual legs do not get tired. Since the seats and specific areas had no function, people just floated anywhere on the screen. Chat Circles’ hearing range gave meaning to proximity. You needed to show that you were interested in a conversation to join it. And if someone irritated you, you could, in effect, walk away.

Chat Circles was designed to be a starting point from which we and others could build worlds of increasing complexity and greater functionality. There were numerous directions to pursue. “Hearing range” functioned as a simple sensory organ for the basic circle avatar. What other senses might we create? The space itself was empty; how could we create functional environments in which to act? The avatars were simple circles; how might we make their appearance richer and more expressive?

Talking in Circles* was a project we built on the Chat Circles foundation (Rodenstein and Donath 2000). Here the users communicated via voice instead of text. The basic look was similar: you appeared as a colored circle that grew and shrank with the amplitude of your voice. While visually simple, it added several useful features to the typical phone conference. You could easily see who was present; even a listener who never said a word was clearly there. You could also easily tell who was speaking (which can be quite difficult in audio-only conferences, unless you know the people well enough to distinguish their voices). In addition, it made it very easy to have quick side conversations: if you and I needed to discuss something together, we could just move our circles to another part of the screen and talk, returning to the main group as soon as we were done, much as people do in real conversations.

A graphical conversation interface turns the screen, which in traditional chats is still a typewriter-like linear stream of text, into a two-dimensional inhabitable space. To make use of this space, we need to design avatars with senses—such as hearing range—and environments with spatially varying functionality. With these features, people have reason to move about the space and to congregate in a location; they have the foundations to create their own social mores.

Talking in Circles had listening areas, places where one could go to listen to music or a newscast. They were meant as gathering places; for example, people could get together to listen to and discuss current events (later versions of Chat Circles similarly had pictures in the background; Donath and Viégas 2002). However, they had no interaction function; other than providing information, they did not change the affordances of the space. A later project, Information Spaces (Harry and Donath 2008), experimented with functional areas (see figure 6.16). For example, in some areas, conversations were archived, while other spaces were designated for ephemeral discussions. One could make some areas anonymous, where all circles/avatars would be identical and nameless. Some areas could require an invitation to enter.

Another variation of Chat Circles experimented with contagious appearance: the avatars were simple shapes and colors, which “wore off” on each other. If you spent a lot of time talking with one person, your colors would both start moving toward an intermediate point. In an alternative version, one retained an “inner core” of original color while changing “social color” in an external ring. Such imitation could have a subtle effect on how people act. When we see people presenting a common appearance, whether in the long term (e.g., similar clothing styles) or short term (similar gestures), we think of them as having something in common. Kurzban, Tooby, and Cosmides (2001) have claimed that we encode coalitions by arbitrary markers; race is a familiar one in everyday life, but they can be anything, even the color of one’s avatar. For the participants, reactions would depend on many things, including how divided the user population was. If the interface was being used to stage discussions between hostile groups, participants might refuse to go near opposing members, appalled at the idea of the other’s shape and color rubbing off on them. In a more sociable setting, people might find it entertaining that their appearance shifted to resemble the people with whom they had spent time. Alternatively, we could make such imitation volitional: one might need to be both close to someone and indicate that one wanted to adopt some of the other’s appearance. Here, it gains communicative value and the potential to evolve social meanings; would it seem rude to not adopt something of your companion’s appearance? Would some people seem fawningly imitative? In real life, “mirroring” is hypothesized to be at the heart of our ability to empathize with others (Bailenson and Yee 2005; Chartrand and Bargh 1999; Heyes 2001; Meltzoff and Decety 2003; Sebanz, Bekkering, and Knoblich 2006). We imitate other’s facial expressions and gestures, which may help us empathize and feel that we are experiencing what they experience.

In a system that gives people the ability to control appearance and action, cultural meanings evolve with use. In a world where, for example, height is an easily modified attribute, we can imagine a culture developing expressions of politeness related to height. This could manifest in how you show awareness of your current status in a particular context, perhaps by growing when you wish to take the floor in a discussion or shrinking as a form of respect to another.

We can extend the simple circle in many other ways. One could encode history by making a space where everyone starts as a plain-colored circle, but gains visual complexity over time. Algorithms might encode participation patterns in shapes, evolving eventually from generic circle to individual data portrait.

The plain circle was not set forth as the ultimate representation of the online human, but as a foundation on which to build. It was able to go “beyond being there” not by adding complexity, but by taking it away—by thinking of the avatar in terms of communicative and sensing functions, rather than as a humanlike representation.

The Palette of Representational Choices

Embodied interactions include a broad palette of representational choices. Appearing as your recognizable physical-world self is inherent to video, but once we move from video to the computationally rendered world of avatar interfaces, the choices multiply. Your avatar could be realistic, looking exactly like you, or it can appear as a distinctive human—but different from your real-world appearance. You can be some other humanlike form, cartoonlike but socially legible, or you can be a fanciful form, or an abstract shape.

Identity is one criterion for choosing between a realistic or fantastic representation. When we see another’s face, even briefly, we learn a lot about his or her social identity: race, age, and gender, as well as social affiliations. Does she have a conservative haircut? Multiple piercings? This provides a useful context for interpreting her words and developing a relationship with her; it can also provide the basis for distorting the meaning of her words because of stereotyped social models (see chapter 9). In person, we have little choice; our faces reveal many identity cues. Online, however, we can choose: we can use media, such as video, that display these cues, or ones that do not. We can also design media that provide alternative identity cues, avatars that reveal the user’s recent interactions or that the user shapes to appear as she wishes others to see her.

Different representational choices have a variety of other social effects. Here we will discuss two of them: how people react to faces and how transforming basic body features, such as height, can alter social dynamics.

A Face in the Interface

Faces humanize the interface. We react socially to faces—we are nicer, but also less forthright and revealing, when talking to a face. This socialized behavior occurs even when the “face” is the interface to a machine.

In one experiment, people were administered a questionnaire by a computer, which was sometimes simply a text interface, sometimes a neutral face, and sometimes a stern face. People said they did not like the stern face, but they took more time responding to the questions posed by it and answered them more thoroughly, even though they reported finding the experience less comfortable (Sproull et al. 1996). They also presented themselves in a more positive light when the computer had a face.8 In all cases the subjects typed their answers. The difference in how forthright and engaged they were was due not to their own communication medium, but to the appearance of their conversational partner. One explanation is that when communicating with a face, we bring our sense of sociability to the interaction. Although the subjects knew that they were corresponding with a machine, they still attempted to make a good impression on it when it was more humanlike.9

Similar effects occur when the interface connects two people. When a conversational partner appears only through words, there is less sense of being connected with that person. In an experiment that compared how people respond when communicating via video and audio or with audio only, the subjects reported feeling a greater sense of the presence of others when they could see them, but also disclosed less about themselves (Bailenson et al. 2006). People find it easier to be rude to another whose face they do not see. Thus, if you are building an interface where you want people to enter their medical history, a text interface is better: you want people to disclose their full history and not be embarrassed about past conditions. If you want people to behave politely and strive to make a good impression, seeing the other person (even if synthesized) can help.

The Social Impact of Physical Transformation

In the physical world, we choose our clothes to present a certain image: a man in a suit projects far more authority than does one in a T-shirt and swim trunks. Similarly, modifications to how we appear in a virtual world can strongly affect the impression we make. Jeremy Bailenson at Stanford University has carried out a series of experiments in immersive virtual settings examining how even subtle transformations of appearance affect social interactions (Yee and Bailenson 2007).10

In one experiment, Bailenson and colleagues demonstrated that people are more trusting of others who resemble themselves. They took pictures of political candidates and, unknown to the subject, morphed the subject’s face with the candidate’s, subtly enough that few detected it. The subjects showed a measurable preference for candidates when they were more similar to themselves (Bailenson et al. 2009). Another experiment showed that programming avatars to automatically imitate another’s gestures makes them more influential (Bailenson and Yee 2005).11 The subtlety of the modifications makes them even more powerful, for the receiver is not likely to notice the manipulation. Indeed, it can be invisible: in another experiment, one subject’s avatar was made to appear taller than the other avatars, but only to that subject; the others saw it as the same height as the rest. The subjects then engaged in a negotiation. The ones who perceived themselves to be taller did better: simply seeing yourself with this advantage changes how you act toward others. These experiments demonstrate the potential of virtual spaces to, in Bailenson’s words, “transform social interaction,” and they provide a rigorous and provocative foundation for thinking about design.

Let us consider the height example. In the face-to-face world, height influences both personal and professional success. Yet, although tall people are perceived to be more competent and effective, height does not predict better actual performance, and for most jobs, “the practice of favoring tall individuals amounts to little more than pure bias” (Judge and Cable 2004). One of the big advantages of the text-based online world is that these influential but irrelevant physical features are invisible: on the Internet, no one knows you are short. If we reintroduce these features in embodied interfaces, can we do so in a way that is socially beneficial?

In the novel Snowcrash, Neal Stephenson posits a world in which avatars are legally required to be the same height as the person they represent. But we can design any rules. We can make worlds in which everyone’s height is strictly equal, or where height is determined by how long one has been a participant or by how esteemed one is by others. We could make a world where the quietest people are taller, to give them more confidence. We could make ones in which height is randomly distributed, changes every day, or is available for purchase. The avatar is a canvas for data portraiture, and Bailenson’s studies show the power of these representational changes.

A key point of Bailenson’s experiment is that transformations of physical appearance have significant social affects even if visible to only one participant—that seeing yourself as taller (or more attractive) makes you more confident and authoritative, even if no one else sees it. This raises the question of whether an interface should be objective, showing a common view to all participants, or subjective, adjusting to the preferences of each. Should I be able to depict, for my own private viewing, the others in my group as unattractive in order to boost my confidence? Alternatively, should people be able to specify how they appear to others, and if so, with what if any constraints? Or should the system determine each participant’s appearance? In a face-to-face meeting, if I nod with vigorous agreement to one person, everyone sees that I have aligned myself with her; if I subsequently express the same support to someone expressing the opposite opinion, I will seem to be either exceedingly persuadable or excessively sycophantic. Yet one could design a virtual meeting system where I can designate who sees me agreeing with one speaker, and a different group would see me nodding with another. Or, as in Bailenson’s study, each person could see a version of me that slightly resembles themselves, and that subtly mimics their gestures.

Attempting to make ourselves appear more attractive, powerful, or persuasive has been an integral part of human culture throughout history. Archaeologists have found both terrifying masks and pigments for cosmetics in prehistoric sites. Venetian women in the Renaissance risked blindness in the pursuit of beauty when they dilated their pupils with drops of belladonna (Feinsod 2000). Today, magazines and lifestyle coaches tell us how to “dress for success”—what styles and colors make us appear more powerful and intelligent; lawyers instruct their clients on how to look innocent and law-abiding. We learn to control the tone of our voice to convey authority or to mask our emotions. Executives gathering for an important meeting seek seats that place them at the greatest advantage. Our daily life comprises numerous ways in which we work to burnish our image in the eyes of others, perhaps subconsciously, perhaps quite deliberately. Are the computer-aided transformations we describe above a continuation of this striving for self-enhancement, or are they something new? Are they within the boundaries of what we consider acceptable, or are they too insidious or deceptive?

There are no easy answers to these questions. Much depends on the intention of the person performing the transformation—is it primarily for his personal gain? Would the receivers agree to it if they were informed? A teacher running an online classroom and using these techniques to motivate students and give confidence to shy ones is quite different from a politician persuading voters to elect him by making himself appear more trustworthy through personalized manipulation.

BioSensing: Inner State as Input

Communication can be frustrating. We misunderstand and are not always straightforward with each other. The thoughts of others are in many ways unknowable, and often this is by design: we deliberately conceal much of what is on our minds. Even when two people want to know each other better—lovers, for example—it is a fraught and difficult process.

People often imagine that communication would be much easier if we could telepathically read one another’s thoughts. No more misunderstandings, no more reticence holding us back from something we ought to say.

Without telepathy,12 we rely on cues to others’ emotional state and intentions. We hear their words and observe their gestures, expressions, and tone of voice. Yet these cues are not in themselves the states of mind; they are signals of them, of greater and lesser reliability (Donath forthcoming). A smile, for example, generally denotes happiness or amusement, though there are also polite smiles, sarcastic smiles, and smile-like grimaces. What we often want to know is the feeling behind it; is it a genuine smile of warmth, or a forced smile masking discomfort?13

An intriguing possibility for improving communication is technological telepathy: using sensors to “read” what someone is thinking or feeling.14 These biosensors measure bodily changes that are associated with different affective states, cues that are not ordinarily perceivable in our everyday interactions. These include heart rate, skin conductivity (how sweaty your hands are), and, ultimately, the electrical and chemical activity in your brain.

One use for this data is to supplement a less expressive medium (Picard and Cosier 1997). Conductive Chat* (Fiore, Lakshmipathy, and DiMicco 2002), for example, combined text chat with affective sensing. Users wore gloves that measured galvanic skin response (GSR) as they typed their messages. When their skin response became elevated, a sign of increased emotional intensity, the application made the text they were sending larger and brighter.

Supplementing text chat with biosensed emotional information is an appealing idea. People use this medium for social interaction, where emotion is an important part of the message, but its speed and brevity make conveying the subtleties of affect quite difficult. Emoticons, which extend punctuation beyond the exclamation point and the question mark to convey sadness, happiness, irony, and so on, evolved and proliferate in this sparse medium, but they are still a limited emotive supplement. Because people often use text chat on mobile devices and in other circumstances where ease is important, a simple sensor that sends affective information and provides the chat with a layer of extra meaning without imposing any extra effort on the user is appealing.

Yet this is a significantly different type of affective communication from ordinary nonverbal expression and gesture. The main issue is the degree of control you have over what you convey. Our face and voice are under our control, though imperfectly. Learning this restraint is part of socialization; babies express their feelings nakedly; by even a few years old, children learn to mask particular emotions in certain contexts (Zeman and Garber 1996). In contrast, pulse, skin conductance, and other such bodily manifestations of emotional state are generally not under our control: they did not evolve for communication, and we are not socialized to manage them in the same way.

Biosensing raises privacy concerns. Today, such sensors are used mostly in coercive situations, such as lie detection. Here the person has information he wants to keep to himself and interrogators use sensors against the subject’s will, attempting to gain access to his private thoughts. Literary portrayals of mind reading often portray it as aggressive and invasive. For example, “legilimency” in the Harry Potter books is the ability to read the thoughts of others against their will—and plant false memories in them.

Yet there are also positive accounts that portray it as an enviable skill, generally between people who are already very close, such as twins. A key criterion for evaluating direct affective communication is whether it is coerced or free, whether the participants are willing to experience this more direct and intimate connection.

Privacy is less of a concern, too, when the results are collective rather than individual. A promising scenario for such a device is distance lecturing. When you give a talk in person, you get feedback from the audience; you see people nodding or laughing. You can also notice if you are losing your audience; you see people moving about, rustling papers, perhaps getting ready to leave. Lectures given at a distance, however, receive little ongoing audience feedback. Sensing devices could convey the level of group-wide attentiveness.

Biosensing could also be used to influence people’s affective states. For example, online discussions often become vitriolic forums where people get angry and abusive. Imagine an interface that conveyed reliable affective information about each participant—and the system made your words smaller and blurrier the angrier you became. You would need to work on being calm, rather than loud and aggressive, in order to get your point across. An experimental game uses brainwave sensing to teach people to relax: only when the subject is calm do the controls work as desired (Nijholt, Bos, and Reuderink 2009). Conversely, one could design rather nefarious games in which players need to work themselves into an actual fury to win a battle.

Affective sensing is still in its early stages. It includes the measurement both of social signals, communicative expressions that have evolved to convey emotion (like facial expression), and bodily responses that correspond to different affective states that neither evolved nor have been previously been used for communication. We are far from a time when people will put on lightweight, stylish helmets and convey their feelings directly to each other. But, as different types of sensing do become available, it is useful to think through the potential impact of such communication. To understand affective sensing, we need to have a more nuanced understanding of how and why our ability to convey emotion, using our normal communicative expressions as social signals, evolved as it did.

Further, the amount of control we have over our expressions is an open question. A popular view of expression (both facial expressions and gestures) is that their function is to be a readout of our internal state. As we become socialized, our visible expressions are filtered through social convention, so while the “natural” expression would be the pure emotion readout, the socialized, filtered expression represses much of that. To maintain social correctness, we struggle to keep the expression of felt emotions at bay. The psychologist Alan Fridlund has argued against this view that deems “fake” those expressions that result from complying with social rules or otherwise conveying a false image (Fridlund 1997). Fridlund critiques what he calls the “crypto-moralistic view of deception”—that an authentic self is hidden under false surface expressions, but detectable through “leaked” expressions (Ekman 1992). Fridlund argues that we would not have evolved a communicative function that was often at odds with what we wanted and with our best interests. Instead, he posits a model based on behavioral ecology: our expressions and gestures evolved as communicative functions, coevolving with responsiveness to them. As communication, their function is to cause a response from the recipient; for example, babies cry in order to get a response from a caretaker. Although some inner state—discomfort, hunger, fear—may trigger the crying, the baby may not be consciously trying to attract attention. The purpose of the crying, and the reason it survived as an evolved trait, is to gain care and attention. The response to babies’ cries coevolved with them. In this model, the “leaked expressions” that Ekman claims are cues to deception are rather the result of being conflicted about what message to send. People who have no ethical conflict about the deception they are making, like sociopathic liars and people who believe that the lie they are telling is morally justified, do not show this “leakage” of truth. This is unsurprising in the behavioral ecology view, as they have no internal conflict about what they are saying.

Framing expressions and gestures as communicative acts, rather than outward displays of inner feelings, puts more starkly the difference between sensing our traditional expressions and biosensing our internal states. The biosensors measure internal states that have to do with our preparedness to respond: fight or flight, for example. They provide a readout of emotion that reflects the physiology of our emotional experience. In this view, the function of our facial expression and gestures is quite different. Our emotional state shapes our facial expression, but so does the social context: whom we are with, the situation, our history with this person, who else is here, what we want to happen, what we want the other person to think, how we want to influence him or her, and so on.

Face and voice are under our control—imperfectly, but trying to control them is part of our socialization process. The physiological correlates of emotion that biosensors measure are not under our control, nor did they evolve for communication. Today’s heart rate and GSR sensors—even today’s fMRIs and other brain activity monitors—produce crude measurements, but they will improve. As a thought experiment, imagine a biosensor capable of “mind-reading,” something that gives us an accurate readout of someone’s internal state. Would we want this?

Of course, we also do not know what such a readout would be. Our mind and memory shape our experience of how we think, even how we see our surroundings, into a coherent narrative. When I look at the world around me, it seems stable, familiar. But the unfiltered, unprocessed perception I have of it is one where my eye is in constant saccadic movement, where the colors change radically with changes in light, where perspectives shift as I move. Our minds take care of smoothing all of this into the familiar sensations we have of seeing a panorama of solid surroundings. Similarly, our thoughts and emotions are much more chaotic than we consciously experience them to be. For this thought experiment, let’s imagine a sensor that can detect the thoughts we have at a conscious level—something perhaps like the stream of consciousness that modern writers such as James Joyce and William Faulkner have tried to capture. This is the interior space of our unshared thoughts; it is our experience of being bored but looking attentive, of noticing the egg stain on a colleague’s shirt. What is most noticeable is the great discrepancy between our thoughts and our normal outward expression. Though one might call it deceptive, that word is appropriate only if we believe that the purpose of our social interactions is to perceive the inner core of the other.

Such naked revelation is far from social. Sociability requires a balance between revealing and restraint. Face to face, we see this in the struggle we have at times to conceal our emotions. We have a great deal of control over what we reveal, but it is far from complete. On the whole, society functions with this balance. We cannot always get away with hiding our thoughts or motivations, but neither are they open for all to read. As we design new social media, we need to think about where we want to set this balance (Farah 2005; Racine and Illes 2007). At one end of the spectrum are interfaces where you can be anything, where concealing much of yourself is easy. At the other end is the futuristic world where direct neural interfaces remove the social barriers we place between our thoughts and what we communicate to others.

Why should we be concerned about these distant future issues? One reason is that it helps us to see the inadvertently revealing aspects of our normal communication as part of a continuum. Even in text, a medium in which we have much editorial control, we reveal a lot, especially if we are writing quickly without careful review. Face to face, our voice, facial expressions, and so on reveal quite a bit, and we inevitably read character into faces (even if we tend to be wrong; Zebrowitz 1997).

Another reason is that at least some versions of biosensing are not all that far off in the future—and are possible without one’s consent. Computer analysis of video images of one’s face on a webcam can reveal blood pressure, pulse, and breathing rate; these biometrics can be used to infer affect (Picard, Vyzas, and Healey 2001; Poh, McDuff, and Picard 2011). What are the implications if, unknown to you, the person you are chatting with via video is conducting this type of analysis?

In the next chapter, we look at further privacy issues that arise with new technologies.

Footnotes
14
Comments
0
comment

No comments here