I'm still thinking more about the feasibility of Project Natal.
Assuming you have a stereo camera setup with one infrared sensor and one camera with a high dynamic range and a very high sample rate even if the resolution is piss poor then I can imagine depth sensing working quite well. If the system has sufficient breadth (which by the pictures it probably does not) then it should be able to get oblique enough views to correct for accidental occlusion of body parts and the inevitable depth errors when for instance, a man wearing no shirt in a room whose lighting is either too bright or too dim for his skin tone is doing things with his hands in front of his chest - or they're wearing a shirt with similar emissive/reflective characteristics as their skin (on both the infrared and visible spectra) - both of which should be considered worst-case scenarios for the purpose of resolving individual body parts.
The infrared emission and reflection characteristics of human flesh are fairly distinct from most natural fiber clothing that people wear - unfortunately, some synthetics show up almost the exact same "color" of infrared as at the various common human skin tones. Unless the infrared sensor has a good gamut in the infrared range, this could produce serious problems with discernment. Nudity, one would suppose, would present all sorts of problems. A terrible scenario would probably be someone's pasty-skinned child trying to play a game while coming in from the sun for a break in their Lycra swimsuit and thick sunscreen and no shirt - especially if they're baggy shorts. Then you just have a confusing mass of light and dark spots slopping around on the screen of the camera in the occasional vague shape of a human child.
Assuming that the RGB camera data mixed with the depth data from the infrared well - which is the best-case scenario that Microsoft is no doubt depending on, then the system could do a sort of depth sampling which would end up, even in less than ideal situations, a bit like that intentionally distorted Radiohead LIDAR cloud which Thom Yorke turns into (only much lower resolution). If that quality of data (or near that) can be extracted, and there's enough skew between the depth camera and the rgb camera, then you should be able to discern hands in front of the body with some depth accuracy - in which case, the device will work quite well as a motion sensor, assuming that the libraries which pick out which body part is which at least return consistent even if not always accurate results. If game developers have to develop their own libraries for discerning anatomical characteristics, then the system's going to flop hard, and I think MS knows this, so we'll probably start seeing XNA updates by the end of the year with the beginnings of motion control libraries in them.
Moving on to facial recognition, it's glitchy at best, but with the additional depth data, it'll help reduce the false positives quite a bit. Assuming some intelligent use of "motion trails" with ordinary face recognition and depth data, some slightly more than trivial object and face recognition might be possible; like "Hey, who's that back on the couch? Is that Dave?" If the Natal system can ever pull that off, even occasionally hilariously incorrectly, then I'll be plenty impressed.
Facial recognition leads to the next problem, however, which is their claim of emotional recognition. Assuming that you're using a gestalt of methods - erratic movements sensed by the depth cameras, facial shifts, basic emotional grammars on the face, vocal tics and other vocal cues, then it could probably manage to detect stress and amusement, but beyond that, I have absolutely no confidence in the ability of the Natal sensor to pick up emotion. Something like that ridiculous Vitality Sensor that Nintendo is putting out would augment the data adequately to give me greater confidence in its results, but it's hard enough to read emotion using a few hundred thousand years of behavioral evolution, let alone trying to do it with technology less than 30 years old.
This brings us to the other point, if you're paying attention to voices, you want more than emotion sensing - so we talk about speech. Speech recognition in something like Natal is really only going to work in absolutely ideal situations unless their multi-array mics are sufficient in number and precision to yield rough positional data which can be cross-referenced using the depth and camera data. If this can be done, then "source filtration" ofthe audio can be done so that you're only trying to process sound from one location.
I could see the developers at Microsoft creating a sort of data structure which represents a fuzzy cloud of positional data which they identify as an "individual" whether it's a human or a cat or a roomba. This entity would become an input source, at least from the perspectives of developers working with the technology for games. From this source, you could pull speech samples (and a confidence number), video data, generalized motion information, and a low-data-rate history of what they've been doing for the past few seconds to help make prediction and resolution of actions a little easier.
As much as remembering inputs is important for, say, an arcade fighting game, it will be even more important in a Natal-enabled game, because these games will not only have to interpret what the player is doing at each instant, they need to resolve "intent" behind the players' current motions - in other words, what they're trying to do. If somebody's wimpy albino kid (from my previous example) tries to throw a punch, that shit is going to be all over the place. If the kid's exact motion shows up on the screen, instead of the desired result (that being a nice punch), then the kid will become frustrated before long and not want to play. The same is true of the fat geek with no muscle tone. Nobody except true athletes really want to hear an absolutely truthful representation of their physical prowess, and in a video game they're going to want to see an idealized representation of their intent. So you average the samples from a fixed amount of time, and see that the kid is vigorously moving his fist forward (or at least his arm) and you make the character throw a punch in the rough direction to whatever's closest to the destination of the punch, relative to the character's position and orientation.
So, all told, Natal promises that it knows what you're doing, how you're doing it, when you did it, what you felt while you did it who you are (at least within their predefined "people list"), what your facial expression was when you were doing it, and anything you may have already said or have been saying when you did it. That's a pretty damned tall order. If they can pull it off with SUFFICIENT precision to make it a not-frustrating experience, then it's going to be THE must-have technology of this generation, but given the ridiculous amount of processing power that something like this will require, I don't see how Microsoft can hope to deliver it with any degree of accuracy on the Xbox 360.
With a "theoretical peak performance" of 115 and change gigaflops on the main processor and maybe a spare hundred or so gigaflops from the GPU unit (assuming the Xbox's rendering system supports that sort of tampering with the pipeline), you're looking at (after the hypervisor) probably about 150 gigaflops to work with. Assuming you're dealing with a composite 320x240 depth image with, say, 40 levels of depth at 60 fps for the depth camera even assuming you're using 8-bit monochrome depth samples you're looking at 11mflops for the video positional data. Assuming "reflection" where it looks back on older data, let's call that about 80mflops. Still under a gigaflop, not bad. Let's now do positional sound data at 40khz from (I'll assume) 4 microphones. Assuming that getting good positional data from it takes about 1000 floating point operations per sample (a probably conservative estimate) we're talking about 160mflops for that. Let's be charitable and say that with intelligent downsampling and some analog filtering we can reduce that by a factor of 8 to 20mflops. We're up to a gigaflop now.
Now let's reexamine our estimates for the floating point power of the Xbox 360. Let's now realize that the system probably can't operate at anything more than about 60% of its theoretical peak for any length of time and realize that 40% of those resources are just going to be tied up with polling devices, the various live internet shit it does for XBL and the hypervisor which allows it to slide in that sweet XBL blade whenever the menu is hit. We're really looking at about 50gflops then for the whole system.
Great, so the motion tracking and positional audio only takes 1/50th of the available power, right? What's the problem? Well, those are the easy part, computationally speaking. Speech recognition is going to take another 3-4 gflops if it's trying to handle multiple source audio. Well, let's be kind and say speech and facial and emotional recognition together will use 4gflops total, bringing up the thing to just under 8% of the total available processor power. That seems perfectly reasonable, actually! Seems feasible for use in games. Unfortunately, the Xbox 360 has only 512mb of RAM... And the depth data alone uncompressed at 20 depth layers 320x240x60fps one byte per voxel with an 8-second history takes up 720mb of ram.
There's a basic law in computer science which states that the faster and more time-efficient an algorithm is going to be, the less memory-efficient it's going to be - the inverse is also true. I can't imagine that really accurate high framerate data which aggregates all of the positional, motion, gestural, facial and emotional and speech data is going to fail to use less than, say, 20% of the available memory and 30% of the available processor time on a 360 even with great technical wizardry on their part - meaning that we're talking about game developers trying to create games with 30% less processor time and 20% less memory - and they have to manage input/wait cycles between their game engine's input layer and the Natal layer.
All of this adds up to a very risky proposition for Microsoft, especially if there is a large amount of R&D money tied up into it and they intend to do a full-scale launch of it before having a significant number of A-list studios signed on to do titles which require it - a feat which seems almost impossible without MS throwing a lot of M$ at them. All of which means that Microsoft really has a lot riding on this product if they're actually trying to sell the product - and after the demos they've already staked their credibility on it being a successful launch.
2010 is going to be one motherfucker of an interesting year for gaming.