Editor’s Note: This post was written collaboratively by Brennan Jones, Sunny Zhang, Priscilla Wong, and Sean Rintel and told from the first-person perspective of Brennan Jones.
One of my life missions is to connect people, and I’ve been pursuing this mission through research projects that bring remote friends, couples, conference attendees, emergency workers, and search and rescue volunteers together. So when I joined the Future of Work theme at Microsoft Research Cambridge for a summer internship in 2019, I was excited. The theme is led by Abigail Sellen, Deputy Director of the lab and a pioneer in video-mediated communication, and I’d be supervised by Senior Researcher Sean Rintel, who leads the Socially Intelligent Meetings workstream. Of course, the irony wasn’t lost on me that I had to travel to the United Kingdom from my home in Vancouver, Canada, to work on video collaboration. This also meant being over 4,600 miles and an eight-hour time difference away from my girlfriend, Sunny Zhang, a Microsoft Software Development Engineer in Vancouver.
Spotlight: On-demand video
We stayed in touch through daily video chats and messaging and even took advantage of a more advanced way of connecting: telepresence robots. Effectively video chat on wheels, a telepresence robot allows a remote individual to drive around another place and see what’s going on from the robot’s camera while people in the space can see the remote individual on the robot’s screen. The Cambridge lab had a Suitable Technologies Beam robot, so late one afternoon, during the first week of my internship, Sunny “beamed in” for a tour. Rather than me carrying Sunny around on my phone or laptop, she “walked” with me; the robot gave her physical and mobile autonomy. She even made a special friend—a mini Wall-E robot sitting on my colleague Martin Grayson’s desk. Martin made Wall-E dance, and in response, Sunny rotated her robot body to dance, cementing their robot friendship.
We took a selfie as a memento of our time there together. But Sunny was still trapped and flattened on the robot’s monitor, much like video chat on a laptop or phone. And from her perspective, I was trapped and flattened on her screen. There was a wall between us. I wanted it to feel more like she was there with me and also wanted her to feel more like she was present.
This desire to connect in meaningful ways, both in our personal and professional lives, is in our nature and is the motivation behind Virtual Robotic Overlay for Online Meetings, or VROOM, for short, an ACM CHI Conference on Human Factors in Computing Systems (CHI 2020) Late-Breaking Work. VROOM is a two-way telepresence system that has two aims. The first is to help a remote individual feel like a remote physical place belongs as much to them as the local people in it. The second is to help local people feel that a remote individual is with them in the same physical space—be they colleagues, friends, or partners. VROOM is our story of making being there remotely a reality.
The ingredients: Mobility, immersion, and presence
Traditional video chat has enabled people to attend meetings remotely, partake in virtual classroom activities, and connect with family members overseas. However, it has obvious physical and spatial limitations. Static cameras with small fields of view restrict how much we can see of one another, make it difficult to refer to things in the other space, and—of course—deny us the choice of looking and moving around one another’s space. To overcome these limitations, exotic solutions have been explored, such as combining 360° cameras with virtual reality (VR), using augmented reality (AR) headsets to see full-body avatars of others in one’s own space, and robotic telepresence.
Usually, virtual space is created via 3D modeling, like in fantasy scenes in VR games, and the people in it are embodied in 3D-illustrated avatars. But it can also be a real place captured by a 360° camera, such as in immersive 360° VR films. The 360° camera works as a “remote eye” for the user, providing the feeling of being in the place. Further, instead of a static camera, researchers have explored attaching 360° cameras onto local individuals, enabling remote people to share the local person’s perspective, or onto a telepresence robot to enable both immersion and autonomy for the remote individual.
While VR and 360° cameras can give remote individuals the illusion they’re in the local space, AR can give local individuals the illusion the remote person is there in the space with them. AR technology maps the physical environment and overlays digital content on top of it. A well-known research example is Holoportation, in which an individual in a space surrounded by cameras is seen live in photorealistic full-size 3D video by a person in another space using a Microsoft HoloLens. A commercial application is Spatial, in which people’s facial selfies are mapped onto 3D-illustrated avatars. Individuals see one another’s avatars via a HoloLens or other head-mounted display.
As impressively convincing as these experiences are, there are two common limitations regarding mobility. First, since individuals’ rooms are laid out differently, avatars may appear to walk through walls or stand on tables, breaking the illusion of presence. There are approaches to solve that using an appropriately mapped mutual space, but that doesn’t solve the second limitation. No matter how good the mutual mapping, remote individuals can’t explore a remote environment by themselves. They must be in a meeting with others and only those others can see them and only in the mapped mutual space.
These three technologies–telepresence robots, VR, and AR — feel so close to what we want, but, without a way to integrate them, just miss the mark. What if they could be combined? Life sometimes prepares all the ingredients for you, and you just need a little bit of motivation to put them together, and with some added spice, you discover a great new dish.
This is exactly what happened for us. Our motivation had begun building a year before my Microsoft internship. During a hike, Sunny asked about attaching an avatar of a remote person to a telepresence robot to help local individuals feel like the remote person was there with them. I loved the idea, and built on it: What if you also attached a 360° camera to the robot to livestream the local space to the remote person in VR to help them feel like they were there? As can happen when you get tied up with life and other research, we didn’t talk much about the idea after that. Little did we know, an opportunity to gather the essential ingredients would present itself: the annual internal Microsoft Hackathon.
Microsoft Hackathon: The perfect opportunity to put idea into action
While our telepresence tour of the Cambridge lab that first week of my internship may have helped bring the idea back to mind, the hackathon at the end of July spurred us into action. During the hackathon, every Microsoft employee has free time to work on any project they want, with anyone in the company. We told Sean our idea, and he was immediately excited about it. We quickly formed a team of colleagues from Cambridge and Vancouver and turned our eight-hour time difference into an advantage. When the Cambridge group finished working for the day, we handed everything over to the Vancouver group. There was always someone working on the project!
By the end of the week, we had a demo experience hard-coded with Sean as our test remote user. A local user wearing a HoloLens could see a cartoon avatar of Sean standing on a hoverboard, moving with the telepresence robot via marker tracking. On the remote side, Sean wore a Windows Mixed Reality VR headset, through which he got a 360° view of the remote space streaming live from a camera on the robot. Although local individuals needed a HoloLens to see Sean’s avatar, he could drive the robot freely around the local space without needing to be in a meeting with anyone. This brought the space to Sean, helping him feel a sense of ownership akin to that of the local people actually in it. The space was his to explore, his to take in, and his to be present in.
We were closer to our vision, but not quite there yet. The avatar was static and non-expressive, and the live video stream had low quality and high latency. Luckily, Sean approached me after the hackathon about pivoting the remainder of my internship to improving VROOM. I jumped at the chance. Sunny was onboard, too, and we were joined by Sean’s Research Assistant, Priscilla Wong, who would help us manage a study comparing standard robotic telepresence to VROOM telepresence (expected to be published later this year).
Stepping out of the monitor
Over the next two months, we upgraded the system and ran the user study. The improved VROOM incorporated the following important adjustments:
- We used a newer 360° camera to increase the quality and reduce the latency of the video stream.
- We increased the fidelity of the avatar, changing it from a cartoon to a photorealistic representation of the remote individual. The Avatar Maker Pro Unity library was used to create the head, which we then combined with a Unity standard animated body.
- We made the avatar more expressive by animating it and rigging the head and arms to move according to remote individuals’ actions, as detected by a gyroscope in the VR headset and handheld VR controllers, respectively. When remote individuals looked around in the VR view, their avatar’s head turned; when their hands moved, their avatar’s arms moved. We also gave remote individuals a first-person view of their avatar body. When they looked down, they could see their shoulders, arms, torso, legs, and feet.
- Simple actions initiated by remote individuals triggered animations on their avatar. When they drove the robot, their avatar’s legs walked; when they spoke, the avatar’s mouth opened and closed.
- Some canned animations, such as blinking and slight body movements when the avatar was idle, rounded out the illusion of an embodied version of the remote individual.
For us, the improvements showed the potential to bring telepresence to a whole new level. Remote individuals were finally able to “step out” from the monitor, having the freedom to explore and more fully immerse themselves in the distant space while also expressing themselves more. They could “walk around”, clap, high-five their local counterparts, extend their arms, and move their head. In turn, those in the local space could better understand their intent thanks to nonverbal cues like head direction and arm gestures.
While the underlying technologies still need more unification, we think of VROOM and similar VR and telepresence technologies as a bridge between people, environments, and experiences. VROOM is our story of connecting people.
Special thanks to all the hackathon team members (from left to right in the below photo), who have helped tremendously with this work: Software Development Engineer Leon Lu, Brennan Jones, Sunny Zhang, Software Engineer Minnie Liu, Software Engineer Xu Cao, Sean Rintel, Software Engineer He Huang, Software Engineer Zhao Jun, Senior Researcher, James Scott and Software Development Engineer Matthew Gan (not pictured).