Hand-held transparent displays are important infrastructure for augmented reality applications. Truly transparent displays are not yet feasible in hand-held form, and a promising alternative is to simulate transparency by displaying the image the user would see if the display were not there. Previous simulated transparent displays have important limitations, such as being tethered to auxiliary workstations, requiring the user to wear obtrusive head-tracking devices, or lacking the depth acquisition support that is needed for an accurate transparency effect for close-range scenes.
We describe a general simulated transparent display and three prototype implementations (P1, P2, and P3), which take advantage of emerging mobile devices and accessories. P1 uses an off-the-shelf smartphone with built-in head-tracking support; P1 is compact and suitable for outdoor scenes, providing an accurate transparency effect for scene distances greater than 6m. P2 uses a tablet with a built-in depth camera; P2 is compact and suitable for short-distance indoor scenes, but the user has to hold the display in a fixed position. P3 uses a conventional tablet enhanced with on-board depth acquisition and head tracking accessories; P3 compensates for user head motion and provides accurate transparency even for close-range scenes. The prototypes are hand-held and self-contained, without the need of auxiliary workstations for computation.
As part of my work at Purdue’s Envision Center, I’ve been investigating the potential of RGB-D sensors for navigation for the visually impaired. There has been some prior work in this area, such as smart canes that contain a depth sensor to give haptic feedback when the cane virtually “touches” objects that are far away. Other efforts involve placing a Kinect onto a helmet and converting the depth image to sound.
So far, most approaches have focused on immediate memory-less navigation aids; whatever is captured by the sensors at the current moment is what is communicated to the user. The approach I’m interested in is accumulating and merging data about the environment into a more complete map of one’s surroundings.
For this project, I’m using the Google Project Tango tablet, which is a 7-inch Android tablet development kit that Google has released. It contains an IR emitter and sensor — think of it as a slower and lower-res Kinect in a tablet. But where the Tango really shines is in its pose estimation, which uses descriptor-matching techniques from images via a fisheye lens, combined with the inertial sensors on the tablet, to determine the relative orientation and translation of the tablet. It uses a callback-based API to inform apps of its latest pose, telling what direction the tablet is pointing relative to when it first started recording, and also where it is located in XYZ world-space coordinates.
Using this system, my approach is to collect depth information about the environment, preserve it in a chunk-based voxel representation, and use OpenAL to generate 3D audio for sonification, in order to make meaningful sounds that are useful for a visually impaired person who wants to use this system for navigation.
Using the Tango API, I set up an OnTangoUpdateListener that overrides onPoseAvailable and onXyzIjAvailable. The first callback gives me a TangoPoseData containing the translation and rotation, relative to whatever reference frame I want (I choose to base everything between the COORDINATE_FRAME_AREA_DESCRIPTION and the COORDINATE_FRAME_DEVICE, so I can take advantage of any area learning in the future).
The onXyzIjAvailable callback provides a byte array that lists out the device-space coordinates of each point in the latest point cloud, in floats (x, y, z, x, y, z, …). There is no particular order to this; the Tango API describes an “ij” array that would allow one to, given a point in a depth map, easily access what the xyz coordinate is of that point. But that “ij” array isn’t implemented yet.
I then hand off the raw points to my ChunkManager. I based a lot of my architecture here off of the very helpful “Let’s Make a Voxel Engine” guide. This guide is useful for helping to understand the complexity and computation tradeoffs that need to be made when doing a voxel representation of an environment.
The ChunkManager, first and foremost, keeps track of Chunks in a SparseArray. Each Chunk will represent the voxel blocks in a certain region of space, and will contain the logic for buffering this data and sending it to the GPU for rendering. My collection of Chunks needs a good way to only create the Chunks that are needed, rather than statically allocating a large number of Chunks that may be unused. Given that the input every update is a collection of unordered points
But before we can get to that, these points need to be put into the right coordinate space. By default, these points are all in device space, and they need to be in world space. Here OpenCV’s Android support comes to the rescue, as this would be too slow to do matrix multiplication for each point in sequence. OpenCV comes with a gemm() function to do general matrix multiplication. I first construct a matrix 3*N out of the array of point coordinates (now converted to a float array), then tack on an extra row of all 1’s to make the multiplication work, then multiply it by the device-to-world matrix. Now the resulting points are all in world space.
Now, back to the Chunk organization. Given a bunch of unordered points, we need to know which Chunk each new point should go into. When a Chunk is created, it keeps track of its index value, X/Y/Z each being 1 byte. I need a hash function that, given an input X/Y/Z index, will return the Chunk I already created for that index, or inform me if there is no such Chunk yet.
I use a Morton encoding for this, interleaving the bits of the X/Y/Z bytes to get a hash function. Many thanks to Jeroen Baert’s blog for helping me understand how to do this quickly.
Each Chunk contains a 3D array of integers, each side having a length of the chunk size (I use 32 to a side). This gets initialized the first time a point needs to be placed in the voxel structure but there is not yet a Chunk for it.
Assuming there is a Chunk created, the Chunk logic determines, given the world space coordinates, where in the Chunk this point should go, and increments the corresponding value in the 3D array. It also marks the Chunk as dirty, meaning that it needs to rebuild its data for rerendering.
When rebuilding, the Chunk class iterates through each of its blocks in the 3D array, and adds vertex data for that block only if that block’s integer value is above a certain threshold. That threshold is based on a value that each time the Chunk is rebuilt. This is a simple form of denoising, which is necessary if you want to avoid a system that thinks there are floating blocks in the air that you may collide with. You don’t want false positives like that in a navigation system.
In short, the denoising principle is this: if a Chunk has been rebuilt C times, then that means there were C point clouds that resulted in at least one block being written to that Chunk. If a certain block in that Chunk has been written to B times, then the value B/C should be high if that block is really there, and B/C should be low if that block was just noise that happened a few times at most. I set the threshold to 20%, so B/C > 0.2 to write the block data.
The block data itself is just the block index information. When rendering all the Chunks, each Chunk sends the chunk index and the block size as part of the uniforms to the shader programs, and uses instancing to pass in constant vertices for a single cube, which get used for each visible block in the Chunk.
As for rendering the scene, I render it 4 times, on square viewports to the front/left/back/right of the user/device’s current position in the scene. This effectively creates a box that surrounds the user, with projected images of the world outside on each side of the box. Top and bottom are ignored, assuming that up/down are less important for navigation. It is rendered as a depth map, with nearby blocks light and far blocks dark.
When rendering this depth map, there will be many areas that are simply black, because there are no blocks or chunks in that area of the world that have been mapped yet. This results in holes that, when converted to sound, may give the impression that there is no obstacle in a particular area. To resolve this, I run each rendered viewpoint through an OpenCV algorithm that fills in black areas based on nearby pixels. It is currently a computationally expensive operation and could use further optimization, but it helps to fill holes in a way that improves as more of the environment is mapped.
Of course, the entire point of this rendering is to get a depth map to convert to sound (I admit that this is a roundabout way of getting depth information, which could be otherwise accomplished using more of an octree structure, but I had already developed the chunk-based rendering at the time).
Sonification is done by sampling a set of points on each viewport, determining a ray from the user’s current position through each point. The color sampled at each point determines the depth on that ray. At this point, a set of positions relative to the user have been found, and the next step is to create sound based on this.
I use an Android port of the OpenAL 3D audio library to generate sounds for navigation. OpenAL allows me to establish the location and orientation of a listener and a certain number of audio sources located around the listener. The gain and pitch of each audio source is adjusted based on the distance between the listener and the source. Each source is given a constant tone sound to play, which is then modulated by OpenAL’s HRTF code to output sound that sounds 3D and binaural.
The result of this is a system that is portable and able to be used with a pair of headphones to give audio cues when navigating an environment in real time. When approaching a wall, I hear a sound in front of me get louder. When I turn to the right, I perceive that sound moving to my left. Because the environment is also rendered behind me, I can have audio eyes on the back of my head.
Here is a quick-and-dirty video recording of the system. I apologize in advance for the poor audio — you will need to have your volume up very high in order to hear the sounds. This is because I attempted to record the audio by placing headphones over the microphone of a Google Glass camera.
Future work will involve improving and optimizing various aspects of this sonification pipeline. In addition, I plan to investigate using bone conduction headphones in order to deliver audio cues without covering up the normal hearing that is important for people with visual disabilities. Furthermore, this system does not currently address the difference in 3D audio pose between a person’s body and the person’s head/ears. One solution is to mount the tablet hardware as a helmet, and another is to use additional body tracking systems to keep the Tango tablet chest-mounted while adding an additional view matrix from chest to head.
Over this past year, I’ve been working with an interdisciplinary team on a research project to use augmented reality to enhance surgical telementoring. With the System for Telementoring with Augmented Reality (STAR), we want to increase the mentor and trainee sense of co-presence through an augmented visual channel that will lead to measurable improvements in the trainee’s surgical performance.
Here is a video demonstrating the current system (as of April 2015):
Ant colony optimization algorithms are useful for pathfinding, such as in robotics and distributed networks. In this simulated scenario, a collection of ants can converge on common solutions to finding nearby food, despite each ant only having knowledge of its immediate surroundings. This particular quality makes ant colony simulation suitable for parallelization and GPGPU.
The basic setup of the simulation is that ants wander randomly from a central nest point, laying a pheromone trail behind them. They pick up food when they encounter it and return it to the nest. Other ants will follow existing trails, strengthening them over time.
The world is simulated using a pair of ping-ponged 3D textures (of size N x N x N). The world state for each cell is encoded in the RGBA values of each pixel: red means there is a nest in the cell, the green level shows how much food remains at the cell, the blue level shows the pheromone trail strength, and alpha is used to mark if an ant is currently present in the cell.
The world simulation is done with a fragment shader that increases the pheromone trail if an ant is in the cell, and otherwise fades the pheromone trail over time.
Simulation of ants is done with a separate pair of 2D textures (of size M x 1). Each pixel represents an ant. The RGB value represents the ant’s XYZ position in the world, and the alpha channel uses bitmasks to encode information about the direction the ant is facing, and whether or not the ant is carrying any food.
The fragment shader that simulates ant behavior does texture lookups on the current world texture to find valid and desired movement locations in a cone in front of the ant. Depending on whether the ant is at food or the nest, it will pick up or drop off food. If there is a trail in front of the ant, the ant will move to the cell in front of it with the strongest trail (though there is an option for the ant to move randomly instead, thus preventing the ant from getting stuck in loops of its own trail).
The volume is rendered using a geometry shader, taking as input a single vertex at each grid point, and outputting a set of triangles for a volumetric mesh near that point. It uses the marching cubes algorithm, which evaluates the trail/nest/food/ant value at a point and at 8 surrounding points, does a lookup of 256 possible triangle configurations, and uses linear interpolation to determine vertex locations.
Swarm behavior of ants is sensitive to variation in ant path-finding algorithm: ants can get stuck in their own trail — though this can also happen in the real world (e.g. ant mills).
The simulation runs easily at real-time rates at a world size of 128 x 128 x 128.