Blog

The Dream of a “VR Mall” is a Fantasy

There’s an old and tired mental picture that a lot of people have about VR and commerce, which is of the “virtual mall” — walking from virtual store to virtual store in some sort of Second Life 3D environment, going over to items and picking them up with gesture controls and trying them on and all that.

I think that this is precisely the wrong mental model to be thinking of.

GeoCities was built around the idea of websites inhabiting a place in cyberspace that was “near” other similar sites.

Back when people were starting to figure out how the Web worked, there was a lot of focus on “proximity” that mimicked the real world. Geocities was built on the idea of neighborhoods, each neighborhood related to a particular subject matter, and each site getting a numbered address. There was an idea of being “next” to other people’s webpages, and there was an ability to browse in the way that you would drive down a street looking at each building in sequence.

There were e-commerce approaches that would attempt to monetize webspace based on whether your site was positioned “next” to a popular site. This is similar to the idea of physical malls, where a few big stores drive traffic to the whole area and the smaller stores benefit from others browsing.

The reason why this existed at the time was that search engines had not advanced; there were maintained hierarchical directories of sites that were maintained manually. With modern search engines, we don’t have that same limitation. When we go onto Amazon to buy something, we aren’t clicking through hyperlinked image-maps of a store in some skeuomorphic e-commerce version of Microsoft Bob. We see text and quickly jump back and forth, in and out, open a bunch of tabs.

Microsoft Bob.

We teleport. We link to multiple places at once. We have gained something by doing shopping online that we don’t get in brick-and-mortar stores. When price comparing in real life, can you quickly hop between two stores? Do you find value in walking between different areas of the store instead of having everything a few clicks away?

Janus VR represents webpages as virtual spaces, and links as doorways.

I like systems like Janus VR because they acknowledge the modern Web-using person’s ability to jump around at will. However, I don’t see this as an effective replacement for our normal modes of online interaction, and I don’t think that VR commerce is going to look like this. Any effective attempt at using VR for shopping must enhance the experience, not try to mimic reality with all its disadvantages in an attempt at perfect fidelity.

For this reason, I absolutely dismiss the idea of the “VR store” in most cases. There may be some uses, like clothes shopping, where you want a virtual environment to be moving around in, but I think that getting that sort of body-scanning “fit” to where it’s usable and not troublesome and glitchy like a lot of augmented reality applications is more far off.

Now, I do think there is value in VR for shopping in the form of visualization of products. I recently saw some neat experimental work of using WebGL to show off products for Best Buy, such as a fridge and a washing machine.

Here’s how I predict that this can happen. First, over time, some companies will have special features for some products in their online stores, where you can click and see an in-browser WebGL visualization of the product, like the ones I linked above. This takes some development time, so we’re not going to see it in scale and done for every single product until 3D scanning becomes a lot cheaper and faster and better — and for many products there’s no additional benefit to seeing it in a 3D Web visualization (though for some it will be beneficial). Think of something like Sketchfab’s viewer, applied to websites that do 3D printing services. These are companies that already have the 3D assets to be able to show something — this is the main bottleneck.

An example of a simple WebGL scene (using three.js) being given WebVR support.

Next, as the work of the WebVR group begins to mature, VR integration into the browser (meaning the head-tracking and lens distortion) will be accessible via Javascript APIs, without needing to use special dev versions of the browser as you do now.

As the consumer version of the Rift and other VR devices becomes more widespread, due to gaming, there will be improvements to these previously-existing WebGL visualizations to basically add a “go into VR mode” button. There are already some nice boilerplate projects that make it easy to support WebVR in a WebGL project and enter and exit VR mode with a button press. This will initially be a promotional thing or a special feature for a small audience, but will eventually be built into the 3D Web visualization libraries as a default “VR mode support.”

So the model I’m seeing for VR support for shopping is definitely not “walking through a virtual mall and staring at kiosks” simulacrum of reality. And it’s also not “looking at Amazon search results in a bunch of floating 3D windows in VR.” I imagine that the bulk of VR-enabled shopping will be navigation using normal Web pages, some of which will have 3D real-time visualizations with “VR mode” enabled by default. In these cases, you’ll be able to press a button, put on your Rift, look at the object in question from different angles, maybe see it in a sample environment, get a sense of scale, and then take off the Rift and order it from the site.

Item-focused VR visualization is the probable approach, not space-focused VR stores. Otherwise, we would have all already transitioned to something like Second Life’s virtual malls by now.

Interview on Assistive Technology Update about Project Tango for Navigation

I was recently part of an interview on the Assistive Technology Update podcast, discussing the work I’ve been involved with in using the Project Tango tablet for navigation for people who are visually impaired. You can listen to it, or read a transcript, here.

Virtual annotations of the surgical field through an augmented reality transparent display

My first publication is now available online:

Andersen, D.; Popescu, V.; Cabrera, M.; Shanghavi, A.; Gomez, G.; Marley, S.; Mullis, B. & Wachs, J. Virtual annotations of the surgical field through an augmented reality transparent display. The Visual Computer, Springer Berlin Heidelberg, 2015, 1-18.

Abstract

Existing telestrator-based surgical telementoring systems require a trainee surgeon to shift focus frequently between the operating field and a nearby monitor to acquire and apply instructions from a remote mentor. We present a novel approach to surgical telementoring where annotations are superimposed directly onto the surgical field using an augmented reality (AR) simulated transparent display. We present our first steps towards realizing this vision, using two networked conventional tablets to allow a mentor to remotely annotate the operating field as seen by a trainee. Annotations are anchored to the surgical field as the trainee tablet moves and as the surgical field deforms or becomes occluded. The system is built exclusively from compact commodity-level components—all imaging and processing are performed on the two tablets.

The final publication is available at http://link.springer.com/article/10.1007/s00371-015-1135-6.

 

Navigation for the Visually Impaired Using a Google Tango RGB-D Tablet

Introduction

As part of my work at Purdue’s Envision Center, I’ve been investigating the potential of RGB-D sensors for navigation for the visually impaired. There has been some prior work in this area, such as smart canes that contain a depth sensor to give haptic feedback when the cane virtually “touches” objects that are far away. Other efforts involve placing a Kinect onto a helmet and converting the depth image to sound.

So far, most approaches have focused on immediate memory-less navigation aids; whatever is captured by the sensors at the current moment is what is communicated to the user. The approach I’m interested in is accumulating and merging data about the environment into a more complete map of one’s surroundings.

For this project, I’m using the Google Project Tango tablet, which is a 7-inch Android tablet development kit that Google has released. It contains an IR emitter and sensor — think of it as a slower and lower-res Kinect in a tablet. But where the Tango really shines is in its pose estimation, which uses descriptor-matching techniques from images via a fisheye lens, combined with the inertial sensors on the tablet, to determine the relative orientation and translation of the tablet. It uses a callback-based API to inform apps of its latest pose, telling what direction the tablet is pointing relative to when it first started recording, and also where it is located in XYZ world-space coordinates.

Using this system, my approach is to collect depth information about the environment, preserve it in a chunk-based voxel representation, and use OpenAL to generate 3D audio for sonification, in order to make meaningful sounds that are useful for a visually impaired person who wants to use this system for navigation.

Voxelization

Using the Tango API, I set up an OnTangoUpdateListener that overrides onPoseAvailable and onXyzIjAvailable. The first callback gives me a TangoPoseData containing the translation and rotation, relative to whatever reference frame I want (I choose to base everything between the COORDINATE_FRAME_AREA_DESCRIPTION and the COORDINATE_FRAME_DEVICE, so I can take advantage of any area learning in the future).

The onXyzIjAvailable callback provides a byte array that lists out the device-space coordinates of each point in the latest point cloud, in floats (x, y, z, x, y, z, …). There is no particular order to this; the Tango API describes an “ij” array that would allow one to, given a point in a depth map, easily access what the xyz coordinate is of that point. But that “ij” array isn’t implemented yet.

I then hand off the raw points to my ChunkManager. I based a lot of my architecture here off of the very helpful “Let’s Make a Voxel Engine” guide. This guide is useful for helping to understand the complexity and computation tradeoffs that need to be made when doing a voxel representation of an environment.

The ChunkManager, first and foremost, keeps track of Chunks in a SparseArray. Each Chunk will represent the voxel blocks in a certain region of space, and will contain the logic for buffering this data and sending it to the GPU for rendering. My collection of Chunks needs a good way to only create the Chunks that are needed, rather than statically allocating a large number of Chunks that may be unused. Given that the input every update is a collection of unordered points

But before we can get to that, these points need to be put into the right coordinate space. By default, these points are all in device space, and they need to be in world space. Here OpenCV’s Android support comes to the rescue, as this would be too slow to do matrix multiplication for each point in sequence. OpenCV comes with a gemm() function to do general matrix multiplication. I first construct a matrix 3*N out of the array of point coordinates (now converted to a float array), then tack on an extra row of all 1’s to make the multiplication work, then multiply it by the device-to-world matrix. Now the resulting points are all in world space.

Now, back to the Chunk organization. Given a bunch of unordered points, we need to know which Chunk each new point should go into. When a Chunk is created, it keeps track of its index value, X/Y/Z each being 1 byte. I need a hash function that, given an input X/Y/Z index, will return the Chunk I already created for that index, or inform me if there is no such Chunk yet.

I use a Morton encoding for this, interleaving the bits of the X/Y/Z bytes to get a hash function. Many thanks to Jeroen Baert’s blog for helping me understand how to do this quickly.

Each Chunk contains a 3D array of integers, each side having a length of the chunk size (I use 32 to a side). This gets initialized the first time a point needs to be placed in the voxel structure but there is not yet a Chunk for it.

Assuming there is a Chunk created, the Chunk logic determines, given the world space coordinates, where in the Chunk this point should go, and increments the corresponding value in the 3D array. It also marks the Chunk as dirty, meaning that it needs to rebuild its data for rerendering.

When rebuilding, the Chunk class iterates through each of its blocks in the 3D array, and adds vertex data for that block only if that block’s integer value is above a certain threshold. That threshold is based on a value that each time the Chunk is rebuilt. This is a simple form of denoising, which is necessary if you want to avoid a system that thinks there are floating blocks in the air that you may collide with. You don’t want false positives like that in a navigation system.

In short, the denoising principle is this: if a Chunk has been rebuilt C times, then that means there were C point clouds that resulted in at least one block being written to that Chunk. If a certain block in that Chunk has been written to B times, then the value B/C should be high if that block is really there, and B/C should be low if that block was just noise that happened a few times at most. I set the threshold to 20%, so B/C > 0.2 to write the block data.

Rendering

The block data itself is just the block index information. When rendering all the Chunks, each Chunk sends the chunk index and the block size as part of the uniforms to the shader programs, and uses instancing to pass in constant vertices for a single cube, which get used for each visible block in the Chunk.

As for rendering the scene, I render it 4 times, on square viewports to the front/left/back/right of the user/device’s current position in the scene. This effectively creates a box that surrounds the user, with projected images of the world outside on each side of the box. Top and bottom are ignored, assuming that up/down are less important for navigation. It is rendered as a depth map, with nearby blocks light and far blocks dark.

When rendering this depth map, there will be many areas that are simply black, because there are no blocks or chunks in that area of the world that have been mapped yet. This results in holes that, when converted to sound, may give the impression that there is no obstacle in a particular area. To resolve this, I run each rendered viewpoint through an OpenCV algorithm that fills in black areas based on nearby pixels. It is currently a computationally expensive operation and could use further optimization, but it helps to fill holes in a way that improves as more of the environment is mapped.

Sonification

Of course, the entire point of this rendering is to get a depth map to convert to sound (I admit that this is a roundabout way of getting depth information, which could be otherwise accomplished using more of an octree structure, but I had already developed the chunk-based rendering at the time).

Sonification is done by sampling a set of points on each viewport, determining a ray from the user’s current position through each point. The color sampled at each point determines the depth on that ray. At this point, a set of positions relative to the user have been found, and the next step is to create sound based on this.

I use an Android port of the OpenAL 3D audio library to generate sounds for navigation. OpenAL allows me to establish the location and orientation of a listener and a certain number of audio sources located around the listener. The gain and pitch of each audio source is adjusted based on the distance between the listener and the source. Each source is given a constant tone sound to play, which is then modulated by OpenAL’s HRTF code to output sound that sounds 3D and binaural.

The result of this is a system that is portable and able to be used with a pair of headphones to give audio cues when navigating an environment in real time. When approaching a wall, I hear a sound in front of me get louder. When I turn to the right, I perceive that sound moving to my left. Because the environment is also rendered behind me, I can have audio eyes on the back of my head.

Video

Here is a quick-and-dirty video recording of the system. I apologize in advance for the poor audio — you will need to have your volume up very high in order to hear the sounds. This is because I attempted to record the audio by placing headphones over the microphone of a Google Glass camera.

Future Work

Future work will involve improving and optimizing various aspects of this sonification pipeline. In addition, I plan to investigate using bone conduction headphones in order to deliver audio cues without covering up the normal hearing that is important for people with visual disabilities. Furthermore, this system does not currently address the difference in 3D audio pose between a person’s body and the person’s head/ears. One solution is to mount the tablet hardware as a helmet, and another is to use additional body tracking systems to keep the Tango tablet chest-mounted while adding an additional view matrix from chest to head.

STAR: System for Telementoring with Augmented Reality

Over this past year, I’ve been working with an interdisciplinary team on a research project to use augmented reality to enhance surgical telementoring. With the System for Telementoring with Augmented Reality (STAR), we want to increase the mentor and trainee sense of co-presence through an augmented visual channel that will lead to measurable improvements in the trainee’s surgical performance.

Here is a video demonstrating the current system (as of April 2015):


You can find more information on our project page.

Architecture

One difficult transition when going from industry to the academic world is the difference in software architecture. When you’re a software engineer and you need to make maintainable code that others will inherit, and when you’re inheriting existing codebases, it’s important to structure your applications in a clear way, use design patterns, and generally focus on the format of your code, since the code is the deliverable.

In research, the code is only a stepping stone toward the data that you need. As a result, it seems to be important just to build something that gets the job done; it’s hard to properly find time to architect things formally.

I’m not sure how I feel about this. There is something pleasing about quickly putting together Python or MATLAB code to get the results that you need, but I feel a sense of “wrongness” when I don’t get the opportunity to make my code good and extensible. There’s a tendency in academic coding to incur high technical debt, with the assumption that you won’t be needing the code as much in the future anyway. When you’re not rewarded on the quality of code but the output, the publication, how can code quality not suffer?

Toolboxes

One of the interesting challenges of academic research is just becoming familiar with the mass of prior research that’s out there. It’s helpful to know that someone out there has almost certainly had almost the same problem as you — but knowing how to find it is difficult. Often it’s a matter of terminology; knowing what the name is for some technique that would be helpful, or even knowing the name for the problem you’re trying to solve, is the key to unlocking so much.

Several of my courses this semester, particularly my graphics course, are focusing on toolboxes, of collections of existing methods that might be useful in certain classes of problem. What’s funny is that as soon as you learn about something, you start to see it everywhere (the “Baader-Meinhof Phenomenon,” as it’s been called).

For example, knowing that my current research would involve learning more about computer vision, I started reading Programming Computer Vision with Python, in which I started reading about concepts like normalized cross-correlation in terms of finding similarity between portions of images. Around the same time, I was tasked by my advisor with investigating point-to-point matching given a reference frame and a current frame. Simultaneously, my graphics course began teaching about convolutions and correlations. This merger of information from different sources has helped me understand the concept better, and also see potential ways to apply it in my research.

GPU-Accelerated 3D Ant Colony Simulation and Visualization

Demo image

Project available on GitHub

This is the final project I did during my first semester of graduate school, for Purdue University’s Fall 2014 CS535 (Interactive Computer Graphics) course.

Ant colony optimization algorithms are useful for pathfinding, such as in robotics and distributed networks. In this simulated scenario, a collection of ants can converge on common solutions to finding nearby food, despite each ant only having knowledge of its immediate surroundings. This particular quality makes ant colony simulation suitable for parallelization and GPGPU.

The basic setup of the simulation is that ants wander randomly from a central nest point, laying a pheromone trail behind them. They pick up food when they encounter it and return it to the nest. Other ants will follow existing trails, strengthening them over time.

The world is simulated using a pair of ping-ponged 3D textures (of size N x N x N). The world state for each cell is encoded in the RGBA values of each pixel: red means there is a nest in the cell, the green level shows how much food remains at the cell, the blue level shows the pheromone trail strength, and alpha is used to mark if an ant is currently present in the cell.

The world simulation is done with a fragment shader that increases the pheromone trail if an ant is in the cell, and otherwise fades the pheromone trail over time.

Simulation of ants is done with a separate pair of 2D textures (of size M x 1). Each pixel represents an ant. The RGB value represents the ant’s XYZ position in the world, and the alpha channel uses bitmasks to encode information about the direction the ant is facing, and whether or not the ant is carrying any food.

The fragment shader that simulates ant behavior does texture lookups on the current world texture to find valid and desired movement locations in a cone in front of the ant. Depending on whether the ant is at food or the nest, it will pick up or drop off food. If there is a trail in front of the ant, the ant will move to the cell in front of it with the strongest trail (though there is an option for the ant to move randomly instead, thus preventing the ant from getting stuck in loops of its own trail).

The volume is rendered using a geometry shader, taking as input a single vertex at each grid point, and outputting a set of triangles for a volumetric mesh near that point. It uses the marching cubes algorithm, which evaluates the trail/nest/food/ant value at a point and at 8 surrounding points, does a lookup of 256 possible triangle configurations, and uses linear interpolation to determine vertex locations.

Swarm behavior of ants is sensitive to variation in ant path-finding algorithm: ants can get stuck in their own trail — though this can also happen in the real world (e.g. ant mills).

The simulation runs easily at real-time rates at a world size of 128 x 128 x 128.

Oculus Rift 3D Medical Scan Viewer — University of Utah Bench-2-Bedside

For the past few months I have been working on a WebGL-based Chrome packaged app for viewing medical CT scans in a virtual environment using the Oculus Rift. This is for the University of Utah Bench-2-Bedside competition, which will be held April 9th. Here is a video of a couple of my team members demonstrating the system.