Updated: 22nd March 2018

Pushing the limits of This HoloLens with Enhanced hand tracking

The HoloLens

Microsoft’s HoloLens packs a slew of abilities, and investigating its various uses is an exciting endeavor. However, in its current shape, the device’s potential is severely limited as it only uses one hand gesture for input and interaction. This post explains my attempts to find a workaround for this restriction, and the answer I came up with having a Leap Motion Controller.

The single gesture supported by the HoloLens from the box provides the abilities roughly equal to a mouse using one button. Take that for traditional computers it’s normal to use mice using 2–4 buttons and a wheel, as well as a keyboard, and you might understand why a single-button mouse equal might fall a bit short by today’s criteria.

But the problem isn’t just about restricting the apparatus into one gesture interaction: Though the gesture itself is quite simple in theory, it’s proven quite hard for first-time users to perform in training. I personally experienced the absence of expressions firsthand (pun not intended) while playing with all the HoloLens in FutuLabs, our inner arm dedicated to the exploration of emerging technologies, and had several thoughts die already at the thought stage due to this restriction.

The Leap Motion Controller in activity

Currently we’re also testing the use of some Leap Motion Controller (LMC), a Licensed USB peripheral that provides quite accurate hand monitoring information in real time. So it happened to me whether it’d be possible to combine the LMC using the HoloLens to permit the creation of custom gestures. This entire thing also occurred to play at a period of time where I was looking for a subject for my Master’s thesis, and this seemed like an ideal thesis project. After discussing with my professor and my supervisor, I set out to build the system.

A few months of effort after I was able to complete an initial version. The result is a method where hand monitoring data is always streamed into the HoloLens, allowing for the implementation of custom gestures. And just recently we decided to open-source the entire project, so as to hopefully open the door for different programmers and researchers to explore a more wide range of connections in mixed fact. This post provides an insight to the workings of the developed system, the last result, and thoughts about the long term.

The information must flow

Physical installment

The above graphic shows the installation I used during development. The LMC is mounted on top of the HoloLens using a tiny angled block in-between, so the LMC has a much better view of the region where palms are usually employed.

As can be observed, the LMC isn’t directly connected to the HoloLens, but instead connected to a notebook computer. This is the reason the LMC has to be physically connected using a USB cable so as to be powered and also ensure a excellent enough transfer speed. Another computer was required since the HoloLens only has one Micro-USB port that maynot be utilized with peripherals. The one thing I needed to do would be to establish a link between the computer and the HoloLens, to stream the sensor information created by the LMC.

In the end I chose to set up two parallel connections, one for loading data and one for control messages. The former uses UDP to maintain the latency to a minimum. The risk associated with the use of UDP is that there’s not any guarantee that the packets will arrive in the order they’re shipped, as well as arrive in any way.

Fortunately, the LMC runs at a higher frequency (roughly 100fps) and each message has a unique, incrementing ID, which makes it relatively simple to account for these risks. For the control messages I used TCP, since speed isn’t crucial but successful delivery of the messages would be.

A reversal of perspective

Aside from the most basic gestures in which the hand’s position and orientation aren’t taken into account, it’s insufficient to utilize the information provided by the LMC. All the positions and rotations provided are out of the LMC’s point of view, but we need to be aware of what they’re from the perspective of the HoloLens. I wanted to create a way of calibrating the two devices (i.e. determine their position and turning in regard to each other) without having any additional and/or distinctive equipment (e.g. custom markers along with a printed layout) so it would be simple for anybody to take the system to utilize. Additionally, it eliminates the risk of equipment failure or needing to bring it together at the most inopportune moment.

The Perspective-n-Point Issue. Source: opencv.org

The starting point for specifying the relationship between the two devices was to get some common features that the two of them are able to recognise. Because of the specialised nature of the LMC, it’s wise to select some component of the palms. Going through the information provided by the LMC, one feature in particular stood out: the fingertips.

On the face of the HoloLens, the sole sensor directly available to programmers is the camera mounted onto the front part of the device. Fortunately, a lot of research is based on how to come across palms in pictures.

This blend of information — 3D points from the LMC and 2D points in a picture from the HoloLens — lines up perfectly using a well-known problem in computer vision and augmented reality called the (PnP) difficulty. In short, if you take an image of some 3D points and are able to determine which 2D point in the image corresponds to that 3D point, then you are able to find out the position and rotation of the camera.

Instance of image used for calibration

The calibration process I ended up with ultimately functions as follows. You begin by taking one or more graphics of your hands free palms as shown in the above image. The more pictures you require, the better your calibration result (should) be, although it’s more tedious to do. The only real requirement here is that the palms need to be held in the shown position. This allows the system to later determine that which fingertip in the image belongs to that finger and set up that point with the proper fingertip from the LMC. At precisely the exact same time since the image is taken, the 3D data from the LMC can be stored for later use.

Left A hand detected from a picture. Right: The fingertips and center of mass detected

The graphics are all delivered to the computer that the LMC is attached to. In principle, processing might also be done over the HoloLens, however since I already had to use another computer for your LMC, I decided I might also make use of it.

After all the pictures are delivered, I have to find the palms in the pictures before I can find the hands on. This proved to be the most difficult undertaking, and it took me to develop my own personal system for discovering skin according to color, combining study from several diverse papers. After the hands are discovered, palms are subsequently found using the process in the work of Prasertsakul and Kondo (2014).

Considering all the pieces together, all that remains is to really solve PnP. As luck would have it, this is such a well-known problem with all these ways of solving it OpenCV, an open source computer vision library, which contains a function for this. The result of running the function is the conversion required for moving the LMC information to the perspective of the HoloLens. The conversion is then delivered into the HoloLens in which it’s utilized to change all information obtained from the LMC.

What has been achieved and looking forward

Outcomes of calibration, seized by Microsoft’s Mixed Reality Capture

The above image shows the results of conducting the calibration, with the crimson spheres showing the positions of the fingertips after implementing the calculated transformation into the LMC data. As can be observed, the results are adequate but not ideal. They are also quite consistent.

It is tough to pinpoint what the specific source of error is, though a couple of possibilities are that the physical and virtual (the one employed for determining where to draw items) cameras of those HoloLens are not entirely aligned, or the things identified as fingertips are not the specific same factors the LMC believes are the fingertips. But the accuracy and consistency collectively should be sufficient to permit the development of custom gestures.

There are two things I would like to improve, the initial one being how the LMC is already mounted. An improved mount that attaches into the HoloLens will make it possible to place the LMC in almost precisely the exact same place each time, which in turn will make it possible to reuse calibration results in case the LMC is removed and reattached between applications.

The next development area is hands on detection. Colour established detection is too unreliable. There are simply too many skin-like colours commonly found in most backgrounds to reliably just extract the palms. Looking at the current study, a neural network based system seems to be the most promising overall purpose solution.

By open sourcing this project I really expect it empowers others to look deeper into the way we can interact in an environment in which the virtual and real are actually mixed together. When talking about devices like the HoloLens, the very first deficiency most people today want to point out would be that the technological one.

It is certainly a fact that the hardware still has a ways go, e.g. in relation to improving the subject of view and processing capacity, but I feel most people today overestimate our current understanding of how to look to this new atmosphere. It is in no way clear that the lessons learned in a traditional 2D context will translate into some mixed reality. On the opposite way, you can’t rely on taking cues from bodily layout either, since the items you interact with are still just virtual.

It is not known whether gestures would be the best option in the first place. When developing, you’ll discover quite quickly how exhausted your arms will get when you have a great deal of interactions. With so many open questions, I am hoping this system can contribute one piece of the puzzle, thus we are able to answer these questions in time.