As part of my research work using the Microsoft HoloLens, I recently had to find a way to render different imagery to each eye (for example, running a shader with a different texture for geometry depending on whether I was viewing it with the left or right eye).
Initially, I had tried the approach discussed by davidlively in the Unity forums, where we take advantage of the fact that OnPreRender() is called twice every frame when rendering in stereo. In this approach, we keep track of a variable for the active eye (0 for left, 1 for right), and alternate between those values (and update a flag in a shader) on each call of OnPreRender(). However, while this is said to work for the Oculus Rift, the HoloLens is different (some pre-optimization?) and this approach had no effect.
Thanks to huntertank on the HoloLens forum for mentioning an approach that seems to work: in Unity, create two separate Camera objects, and make sure they are placed on separate layers. In the Camera settings in the Inspector, set “Target Eye” to Left for one eye, and Right for the other. Attach a script with an OnPreRender() function to each Camera, and in that OnPreRender() function set the uniform value in your shader to left or right.
The result of this is that I can, for example, render scene geometry in red in one eye and render it in blue in the other eye.
I recently started using the Microsoft HoloLens for some graphics research work. As part of this, I needed to capture imagery of the display by placing a camera behind the display and recording while the HoloLens was in a dark environment (the typical mixed-reality capture was unsuitable for my needs).
However, because the HoloLens uses the RGB cameras as part of its spatial mapping features, placing the HoloLens into a dark environment causes its spatial tracking to fail. This would not be an issue for my immediate research needs, except that when spatial tracking fails in a Unity HoloLens application, the application pauses and displays a “Trying to map your surroundings…” splash screen until spatial mapping is possible again (i.e. until the lights are turned back on).
Normally, to suppress this splash screen, one would go into Edit -> Project Settings -> Player, and uncheck the “On Tracking Loss Pause and Show Image” checkbox (as described here).
However, because I am using a Unity Personal license, this checkbox was disabled. This is a consequence of the fact that all splash-screen-related settings are disabled in Unity Personal, and this developer build of Unity had placed this rather important setting into that same category.
Thankfully, I was able to find a find a workaround that allowed me to suppress that splash screen despite tracking loss.
Under [project directory]\ProjectSettings\ProjectSettings.asset, find the line that says:
And change it to:
I hope that in future versions of the Unity HoloLens build, this setting is made editable in the editor even for Unity Personal users. The spirit and rationale of the splash-screen limitation shouldn’t extend to a feature that is a vital piece of control in a HoloLens application.
Hand-held transparent displays are important infrastructure for augmented reality applications. Truly transparent displays are not yet feasible in hand-held form, and a promising alternative is to simulate transparency by displaying the image the user would see if the display were not there. Previous simulated transparent displays have important limitations, such as being tethered to auxiliary workstations, requiring the user to wear obtrusive head-tracking devices, or lacking the depth acquisition support that is needed for an accurate transparency effect for close-range scenes.
We describe a general simulated transparent display and three prototype implementations (P1, P2, and P3), which take advantage of emerging mobile devices and accessories. P1 uses an off-the-shelf smartphone with built-in head-tracking support; P1 is compact and suitable for outdoor scenes, providing an accurate transparency effect for scene distances greater than 6m. P2 uses a tablet with a built-in depth camera; P2 is compact and suitable for short-distance indoor scenes, but the user has to hold the display in a fixed position. P3 uses a conventional tablet enhanced with on-board depth acquisition and head tracking accessories; P3 compensates for user head motion and provides accurate transparency even for close-range scenes. The prototypes are hand-held and self-contained, without the need of auxiliary workstations for computation.
I was recently part of an interview on the Assistive Technology Update podcast, discussing the work I’ve been involved with in using the Project Tango tablet for navigation for people who are visually impaired. You can listen to it, or read a transcript, here.
Existing telestrator-based surgical telementoring systems require a trainee surgeon to shift focus frequently between the operating field and a nearby monitor to acquire and apply instructions from a remote mentor. We present a novel approach to surgical telementoring where annotations are superimposed directly onto the surgical field using an augmented reality (AR) simulated transparent display. We present our first steps towards realizing this vision, using two networked conventional tablets to allow a mentor to remotely annotate the operating field as seen by a trainee. Annotations are anchored to the surgical field as the trainee tablet moves and as the surgical field deforms or becomes occluded. The system is built exclusively from compact commodity-level components—all imaging and processing are performed on the two tablets.
As part of my work at Purdue’s Envision Center, I’ve been investigating the potential of RGB-D sensors for navigation for the visually impaired. There has been some prior work in this area, such as smart canes that contain a depth sensor to give haptic feedback when the cane virtually “touches” objects that are far away. Other efforts involve placing a Kinect onto a helmet and converting the depth image to sound.
So far, most approaches have focused on immediate memory-less navigation aids; whatever is captured by the sensors at the current moment is what is communicated to the user. The approach I’m interested in is accumulating and merging data about the environment into a more complete map of one’s surroundings.
For this project, I’m using the Google Project Tango tablet, which is a 7-inch Android tablet development kit that Google has released. It contains an IR emitter and sensor — think of it as a slower and lower-res Kinect in a tablet. But where the Tango really shines is in its pose estimation, which uses descriptor-matching techniques from images via a fisheye lens, combined with the inertial sensors on the tablet, to determine the relative orientation and translation of the tablet. It uses a callback-based API to inform apps of its latest pose, telling what direction the tablet is pointing relative to when it first started recording, and also where it is located in XYZ world-space coordinates.
Using this system, my approach is to collect depth information about the environment, preserve it in a chunk-based voxel representation, and use OpenAL to generate 3D audio for sonification, in order to make meaningful sounds that are useful for a visually impaired person who wants to use this system for navigation.
Using the Tango API, I set up an OnTangoUpdateListener that overrides onPoseAvailable and onXyzIjAvailable. The first callback gives me a TangoPoseData containing the translation and rotation, relative to whatever reference frame I want (I choose to base everything between the COORDINATE_FRAME_AREA_DESCRIPTION and the COORDINATE_FRAME_DEVICE, so I can take advantage of any area learning in the future).
The onXyzIjAvailable callback provides a byte array that lists out the device-space coordinates of each point in the latest point cloud, in floats (x, y, z, x, y, z, …). There is no particular order to this; the Tango API describes an “ij” array that would allow one to, given a point in a depth map, easily access what the xyz coordinate is of that point. But that “ij” array isn’t implemented yet.
I then hand off the raw points to my ChunkManager. I based a lot of my architecture here off of the very helpful “Let’s Make a Voxel Engine” guide. This guide is useful for helping to understand the complexity and computation tradeoffs that need to be made when doing a voxel representation of an environment.
The ChunkManager, first and foremost, keeps track of Chunks in a SparseArray. Each Chunk will represent the voxel blocks in a certain region of space, and will contain the logic for buffering this data and sending it to the GPU for rendering. My collection of Chunks needs a good way to only create the Chunks that are needed, rather than statically allocating a large number of Chunks that may be unused. Given that the input every update is a collection of unordered points
But before we can get to that, these points need to be put into the right coordinate space. By default, these points are all in device space, and they need to be in world space. Here OpenCV’s Android support comes to the rescue, as this would be too slow to do matrix multiplication for each point in sequence. OpenCV comes with a gemm() function to do general matrix multiplication. I first construct a matrix 3*N out of the array of point coordinates (now converted to a float array), then tack on an extra row of all 1’s to make the multiplication work, then multiply it by the device-to-world matrix. Now the resulting points are all in world space.
Now, back to the Chunk organization. Given a bunch of unordered points, we need to know which Chunk each new point should go into. When a Chunk is created, it keeps track of its index value, X/Y/Z each being 1 byte. I need a hash function that, given an input X/Y/Z index, will return the Chunk I already created for that index, or inform me if there is no such Chunk yet.
I use a Morton encoding for this, interleaving the bits of the X/Y/Z bytes to get a hash function. Many thanks to Jeroen Baert’s blog for helping me understand how to do this quickly.
Each Chunk contains a 3D array of integers, each side having a length of the chunk size (I use 32 to a side). This gets initialized the first time a point needs to be placed in the voxel structure but there is not yet a Chunk for it.
Assuming there is a Chunk created, the Chunk logic determines, given the world space coordinates, where in the Chunk this point should go, and increments the corresponding value in the 3D array. It also marks the Chunk as dirty, meaning that it needs to rebuild its data for rerendering.
When rebuilding, the Chunk class iterates through each of its blocks in the 3D array, and adds vertex data for that block only if that block’s integer value is above a certain threshold. That threshold is based on a value that each time the Chunk is rebuilt. This is a simple form of denoising, which is necessary if you want to avoid a system that thinks there are floating blocks in the air that you may collide with. You don’t want false positives like that in a navigation system.
In short, the denoising principle is this: if a Chunk has been rebuilt C times, then that means there were C point clouds that resulted in at least one block being written to that Chunk. If a certain block in that Chunk has been written to B times, then the value B/C should be high if that block is really there, and B/C should be low if that block was just noise that happened a few times at most. I set the threshold to 20%, so B/C > 0.2 to write the block data.
The block data itself is just the block index information. When rendering all the Chunks, each Chunk sends the chunk index and the block size as part of the uniforms to the shader programs, and uses instancing to pass in constant vertices for a single cube, which get used for each visible block in the Chunk.
As for rendering the scene, I render it 4 times, on square viewports to the front/left/back/right of the user/device’s current position in the scene. This effectively creates a box that surrounds the user, with projected images of the world outside on each side of the box. Top and bottom are ignored, assuming that up/down are less important for navigation. It is rendered as a depth map, with nearby blocks light and far blocks dark.
When rendering this depth map, there will be many areas that are simply black, because there are no blocks or chunks in that area of the world that have been mapped yet. This results in holes that, when converted to sound, may give the impression that there is no obstacle in a particular area. To resolve this, I run each rendered viewpoint through an OpenCV algorithm that fills in black areas based on nearby pixels. It is currently a computationally expensive operation and could use further optimization, but it helps to fill holes in a way that improves as more of the environment is mapped.
Of course, the entire point of this rendering is to get a depth map to convert to sound (I admit that this is a roundabout way of getting depth information, which could be otherwise accomplished using more of an octree structure, but I had already developed the chunk-based rendering at the time).
Sonification is done by sampling a set of points on each viewport, determining a ray from the user’s current position through each point. The color sampled at each point determines the depth on that ray. At this point, a set of positions relative to the user have been found, and the next step is to create sound based on this.
I use an Android port of the OpenAL 3D audio library to generate sounds for navigation. OpenAL allows me to establish the location and orientation of a listener and a certain number of audio sources located around the listener. The gain and pitch of each audio source is adjusted based on the distance between the listener and the source. Each source is given a constant tone sound to play, which is then modulated by OpenAL’s HRTF code to output sound that sounds 3D and binaural.
The result of this is a system that is portable and able to be used with a pair of headphones to give audio cues when navigating an environment in real time. When approaching a wall, I hear a sound in front of me get louder. When I turn to the right, I perceive that sound moving to my left. Because the environment is also rendered behind me, I can have audio eyes on the back of my head.
Here is a quick-and-dirty video recording of the system. I apologize in advance for the poor audio — you will need to have your volume up very high in order to hear the sounds. This is because I attempted to record the audio by placing headphones over the microphone of a Google Glass camera.
Future work will involve improving and optimizing various aspects of this sonification pipeline. In addition, I plan to investigate using bone conduction headphones in order to deliver audio cues without covering up the normal hearing that is important for people with visual disabilities. Furthermore, this system does not currently address the difference in 3D audio pose between a person’s body and the person’s head/ears. One solution is to mount the tablet hardware as a helmet, and another is to use additional body tracking systems to keep the Tango tablet chest-mounted while adding an additional view matrix from chest to head.
One difficult transition when going from industry to the academic world is the difference in software architecture. When you’re a software engineer and you need to make maintainable code that others will inherit, and when you’re inheriting existing codebases, it’s important to structure your applications in a clear way, use design patterns, and generally focus on the format of your code, since the code is the deliverable.
In research, the code is only a stepping stone toward the data that you need. As a result, it seems to be important just to build something that gets the job done; it’s hard to properly find time to architect things formally.
I’m not sure how I feel about this. There is something pleasing about quickly putting together Python or MATLAB code to get the results that you need, but I feel a sense of “wrongness” when I don’t get the opportunity to make my code good and extensible. There’s a tendency in academic coding to incur high technical debt, with the assumption that you won’t be needing the code as much in the future anyway. When you’re not rewarded on the quality of code but the output, the publication, how can code quality not suffer?
One of the interesting challenges of academic research is just becoming familiar with the mass of prior research that’s out there. It’s helpful to know that someone out there has almost certainly had almost the same problem as you — but knowing how to find it is difficult. Often it’s a matter of terminology; knowing what the name is for some technique that would be helpful, or even knowing the name for the problem you’re trying to solve, is the key to unlocking so much.
Several of my courses this semester, particularly my graphics course, are focusing on toolboxes, of collections of existing methods that might be useful in certain classes of problem. What’s funny is that as soon as you learn about something, you start to see it everywhere (the “Baader-Meinhof Phenomenon,” as it’s been called).
For example, knowing that my current research would involve learning more about computer vision, I started reading Programming Computer Vision with Python, in which I started reading about concepts like normalized cross-correlation in terms of finding similarity between portions of images. Around the same time, I was tasked by my advisor with investigating point-to-point matching given a reference frame and a current frame. Simultaneously, my graphics course began teaching about convolutions and correlations. This merger of information from different sources has helped me understand the concept better, and also see potential ways to apply it in my research.