Navigation for the Visually Impaired Using a Google Tango RGB-D Tablet

Introduction

As part of my work at Purdue’s Envision Center, I’ve been investigating the potential of RGB-D sensors for navigation for the visually impaired. There has been some prior work in this area, such as smart canes that contain a depth sensor to give haptic feedback when the cane virtually “touches” objects that are far away. Other efforts involve placing a Kinect onto a helmet and converting the depth image to sound.

So far, most approaches have focused on immediate memory-less navigation aids; whatever is captured by the sensors at the current moment is what is communicated to the user. The approach I’m interested in is accumulating and merging data about the environment into a more complete map of one’s surroundings.

For this project, I’m using the Google Project Tango tablet, which is a 7-inch Android tablet development kit that Google has released. It contains an IR emitter and sensor — think of it as a slower and lower-res Kinect in a tablet. But where the Tango really shines is in its pose estimation, which uses descriptor-matching techniques from images via a fisheye lens, combined with the inertial sensors on the tablet, to determine the relative orientation and translation of the tablet. It uses a callback-based API to inform apps of its latest pose, telling what direction the tablet is pointing relative to when it first started recording, and also where it is located in XYZ world-space coordinates.

Using this system, my approach is to collect depth information about the environment, preserve it in a chunk-based voxel representation, and use OpenAL to generate 3D audio for sonification, in order to make meaningful sounds that are useful for a visually impaired person who wants to use this system for navigation.

Voxelization

Using the Tango API, I set up an OnTangoUpdateListener that overrides onPoseAvailable and onXyzIjAvailable. The first callback gives me a TangoPoseData containing the translation and rotation, relative to whatever reference frame I want (I choose to base everything between the COORDINATE_FRAME_AREA_DESCRIPTION and the COORDINATE_FRAME_DEVICE, so I can take advantage of any area learning in the future).

The onXyzIjAvailable callback provides a byte array that lists out the device-space coordinates of each point in the latest point cloud, in floats (x, y, z, x, y, z, …). There is no particular order to this; the Tango API describes an “ij” array that would allow one to, given a point in a depth map, easily access what the xyz coordinate is of that point. But that “ij” array isn’t implemented yet.

I then hand off the raw points to my ChunkManager. I based a lot of my architecture here off of the very helpful “Let’s Make a Voxel Engine” guide. This guide is useful for helping to understand the complexity and computation tradeoffs that need to be made when doing a voxel representation of an environment.

The ChunkManager, first and foremost, keeps track of Chunks in a SparseArray. Each Chunk will represent the voxel blocks in a certain region of space, and will contain the logic for buffering this data and sending it to the GPU for rendering. My collection of Chunks needs a good way to only create the Chunks that are needed, rather than statically allocating a large number of Chunks that may be unused. Given that the input every update is a collection of unordered points

But before we can get to that, these points need to be put into the right coordinate space. By default, these points are all in device space, and they need to be in world space. Here OpenCV’s Android support comes to the rescue, as this would be too slow to do matrix multiplication for each point in sequence. OpenCV comes with a gemm() function to do general matrix multiplication. I first construct a matrix 3*N out of the array of point coordinates (now converted to a float array), then tack on an extra row of all 1’s to make the multiplication work, then multiply it by the device-to-world matrix. Now the resulting points are all in world space.

Now, back to the Chunk organization. Given a bunch of unordered points, we need to know which Chunk each new point should go into. When a Chunk is created, it keeps track of its index value, X/Y/Z each being 1 byte. I need a hash function that, given an input X/Y/Z index, will return the Chunk I already created for that index, or inform me if there is no such Chunk yet.

I use a Morton encoding for this, interleaving the bits of the X/Y/Z bytes to get a hash function. Many thanks to Jeroen Baert’s blog for helping me understand how to do this quickly.

Each Chunk contains a 3D array of integers, each side having a length of the chunk size (I use 32 to a side). This gets initialized the first time a point needs to be placed in the voxel structure but there is not yet a Chunk for it.

Assuming there is a Chunk created, the Chunk logic determines, given the world space coordinates, where in the Chunk this point should go, and increments the corresponding value in the 3D array. It also marks the Chunk as dirty, meaning that it needs to rebuild its data for rerendering.

When rebuilding, the Chunk class iterates through each of its blocks in the 3D array, and adds vertex data for that block only if that block’s integer value is above a certain threshold. That threshold is based on a value that each time the Chunk is rebuilt. This is a simple form of denoising, which is necessary if you want to avoid a system that thinks there are floating blocks in the air that you may collide with. You don’t want false positives like that in a navigation system.

In short, the denoising principle is this: if a Chunk has been rebuilt C times, then that means there were C point clouds that resulted in at least one block being written to that Chunk. If a certain block in that Chunk has been written to B times, then the value B/C should be high if that block is really there, and B/C should be low if that block was just noise that happened a few times at most. I set the threshold to 20%, so B/C > 0.2 to write the block data.

Rendering

The block data itself is just the block index information. When rendering all the Chunks, each Chunk sends the chunk index and the block size as part of the uniforms to the shader programs, and uses instancing to pass in constant vertices for a single cube, which get used for each visible block in the Chunk.

As for rendering the scene, I render it 4 times, on square viewports to the front/left/back/right of the user/device’s current position in the scene. This effectively creates a box that surrounds the user, with projected images of the world outside on each side of the box. Top and bottom are ignored, assuming that up/down are less important for navigation. It is rendered as a depth map, with nearby blocks light and far blocks dark.

When rendering this depth map, there will be many areas that are simply black, because there are no blocks or chunks in that area of the world that have been mapped yet. This results in holes that, when converted to sound, may give the impression that there is no obstacle in a particular area. To resolve this, I run each rendered viewpoint through an OpenCV algorithm that fills in black areas based on nearby pixels. It is currently a computationally expensive operation and could use further optimization, but it helps to fill holes in a way that improves as more of the environment is mapped.

Sonification

Of course, the entire point of this rendering is to get a depth map to convert to sound (I admit that this is a roundabout way of getting depth information, which could be otherwise accomplished using more of an octree structure, but I had already developed the chunk-based rendering at the time).

Sonification is done by sampling a set of points on each viewport, determining a ray from the user’s current position through each point. The color sampled at each point determines the depth on that ray. At this point, a set of positions relative to the user have been found, and the next step is to create sound based on this.

I use an Android port of the OpenAL 3D audio library to generate sounds for navigation. OpenAL allows me to establish the location and orientation of a listener and a certain number of audio sources located around the listener. The gain and pitch of each audio source is adjusted based on the distance between the listener and the source. Each source is given a constant tone sound to play, which is then modulated by OpenAL’s HRTF code to output sound that sounds 3D and binaural.

The result of this is a system that is portable and able to be used with a pair of headphones to give audio cues when navigating an environment in real time. When approaching a wall, I hear a sound in front of me get louder. When I turn to the right, I perceive that sound moving to my left. Because the environment is also rendered behind me, I can have audio eyes on the back of my head.

Video

Here is a quick-and-dirty video recording of the system. I apologize in advance for the poor audio — you will need to have your volume up very high in order to hear the sounds. This is because I attempted to record the audio by placing headphones over the microphone of a Google Glass camera.

Future Work

Future work will involve improving and optimizing various aspects of this sonification pipeline. In addition, I plan to investigate using bone conduction headphones in order to deliver audio cues without covering up the normal hearing that is important for people with visual disabilities. Furthermore, this system does not currently address the difference in 3D audio pose between a person’s body and the person’s head/ears. One solution is to mount the tablet hardware as a helmet, and another is to use additional body tracking systems to keep the Tango tablet chest-mounted while adding an additional view matrix from chest to head.

5 thoughts on “Navigation for the Visually Impaired Using a Google Tango RGB-D Tablet”

  1. Hi Dan,
    thanks a lot for the great article and sharing your knowledge. I thought about starting a side project regarding something similar, but you already went all the way through. :) To me this seems to be wonderful application of computer vision.
    I was also quite impressed how well humans can adept in using sound for object recognition by watching Daniel Kish in this TED Talk: https://www.youtube.com/watch?v=ob-P2a6Mrjs
    This is probably nothing new to you after you have gone to this extend however.

    Have you proceeded with your research of this?

    Best regards,

Leave a Reply

Your email address will not be published. Required fields are marked *