Kinect from Mr Kinect himself at Microsoft
Man or mouse?
Are you a man or mouse? You can now be both, as Kinect gives your body control of the computer. Screens are 2D but the world is 3D, that’s why there’s a mismatch. Computers are poor on real world space, and so we have to tell them who we are and what we want them to do through keyboards, mice, joysticks and touchscreens. But with Kinect, the real world, with you in it, is now an operating environment. Kinect-like interfaces allow 3D interaction without all of those helmets, gloves and body sensors. It’s virtual reality without the hassle of gadget armour. It allows the real person to operate within the computer and within environments generated by the computer. Just step up and off you go.
You as interface
How does it do this? Well, I lucked out this week as I had two sessions with the guy who heads up the Kinect technical team. It was like speaking to someone from the far future. A true ‘Me-interface’ has to recognise you as a body, along with your voice and what you say. This ain’t easy. In Kinectables, you can stroke, feed and train animals. You stroke your chosen pet, and see it respond, then give it a name by saying it out loud. You can toss a ball to your cub and he’ll nod it back and use voice commands, such as ‘play dead’ and he’ll drop.
The first problem with body position is size: we’re fat, thin, tall, short. On top of this we come in lots of different shapes. Then there’s appearance, in terms of hair, clothes, glasses etc.. Now add in the clutter of a background. How do you pick bodies out? Kinect’s cameras peel you away from your background. Note that 2D doesn’t do it for this task, you need 3D as depth images allow you to recognise body parts.
To understand how Kinect works, you need to see it as a database with over 1 million body positions that is rapidly compared with the output of the depth cameras (infrared plus monochrome). The infrared laser projects a grid of 50,000 dots and the RBG camera picks up the depth difference between these dots through parallax differences. The body is then reduced to around 30 body parts based on joint positions i.e. reduced to angles and positions. It is interrogated and position inferred. In that respect it’s more Deep Blue than a pure rules set. But the software also learns and this is the key to its success. It can track six people but only cope with two serious game players at a time. 1.2-3.5 metres and the tilting motor adjusts the sensor by up to 27 degrees. The Kinect software takes up around 190Mb and is a compromise, as the games guys want most of the available space for their games software.
This is not as clever as the Peter Molyneux video suggests as it’s limited to commands, and is currently quite poor on natural language recognition. Just imagine the technical problems of isolating the sound from the background noise during a loud game and tracking different voices in a 3D environment. It does, however, have an array microphone, making it directional, so it can distinguish and isolate several different moving sound sources. You can use this for audio and video chat through Xbox Live.
Looking to the future, natural language processing is notoriously difficult but affordable software such as Dragon is around. Once this reaches a consumer price point and efficacy that allows it to be embedded in games consoles and other mobile devices, another step will have been taken in terms of the ‘Me-interface’. Google Translate for Android has just been updated to include a live conversation translator. You click on the microphone, speak, and it reads aloud the translated text.
The development kits have not been released, except for current developers and a few universities. Indeed, there’s a debate going on within Microsoft about open v closed development. My money’s on ‘closed’ as it’s in the Microsoft DNA. The hacked MIT open source release is only the for the depth camera, so there’s no body configuration stuff and that’s what really matters. So what’s in the pipeline?
They bought Primesense, bought for their camera technology along with a couple of other advanced camera companies, one is Canesta, and that tells you what’s coming. The next version with increase all dimensions by 4. Remember that increasing a current 50-60,000 number along each axis gives you a quantum leap in fidelity. It will easily resolve fingers and other smaller objects (at the moment it recognises your hands only). Even more astounding is the fact that within two years the ‘parallax’ sensing of the current Kinect will be replaced with ‘speed of light’ Canesta sensing, where differentials in the speed of light determine position. Now that’s not a step change it’s a dimensional leap that gives us accuracy.
Knowing where someone is in terms of body position and gesture has huge possibilities. A hugely accurate and high-fidelity system could replicate you elsewhere either as a hologram or robot, that mimics your every movement. This transportation can replace travel. Here’s a quick Kinect hack with Kinect as a robot (Kinectbot), where it moves around and recognises objects and people, along with gesture control. To give you some idea of the creativity unleashed by Kinect see these 12 favourite Kinect hacks. It can know what you’re doing when driving, so that it could warn you when you’re using a mobile or nodding off. It can take gesture commands, rather than reaching out to buttons on your radio or satnav. Surgeons in operating theatre can use gestures to get up X-rays or MRi scans during operations as they can’t touch possibly infected keyboards or touchscreens.
Incidentally, Steve Ballmer has also announced that there’ll be a PC version.