I mentioned in the first post of this series that our initial intentions were to make a teaching simulator using virtual reality; focusing on the Oculus Rift. I also mentioned that given the early stages of that technology, and the immediate and obvious difficulties with it, we decided to put that part of our plan on hold in order to explore other technologies that might be more useful. The first technology we chose to experiment with is the Microsoft Kinect 2.0.
Microsoft’s Xbox One was launched in 2013 and was intended to be a blend of a video game console and a complete home entertainment system. Prior to launch, the Xbox One was met with harsh criticism because it was said to required two things. First, it was supposed to be always on and always connected to the internet. Second, it was mandatory that it was always connected to the Kinect, which initially shipped with every Xbox One. Microsoft has since changed it’s stance on both policies due to public back lash. Xbox One can now be purchased with or with out Kinect and no longer requires it, or internet, to function.
Why was the public so upset by the required Kinect? For two reasons really. One, the always on and connected to the internet Kinect cameras presented a privacy risk since there was a possibility it could be activated remotely and witness whatever was going on in you living room. Second, initially the Kinect was hogging up a large portion of the processing power of the new more powerful Xbox One. Buyers were concerned that it would be taking away from available processing power that otherwise could be dedicated to games that had no use for the Kinect at all; which is pretty much all of them.
Kinect’s adoption, among third party developers for home console games, has been less than impressive. With the release of a Kinect SDK for Windows in 2012, companies outside of the gaming world are able to experiment with the technology. The adoption of this, as I will mention later with the difficulty in finding examples and documentation, has also been questionable.
The first thing I am going to experiment with, using the Kinect, is voice commands.
In a classroom, teachers interact with students primarily with their voice. So, the most natural way of interacting with a teaching simulator, would also be by talking. First, there was a couple of things I needed to start experimenting with voice command technology.
I needed a microphone. For this, I am using the Microsoft Kinect 2.0 device. Yes, the Kinect is a bit of overkill when you just need a microphone, but later I am going to be experimenting with gesture capture that will also be using the Kinect.
Next I needed a PC running Windows 8 so that I could use the Kinect. I also needed something with some significant power to handle 3D rendering especially if we do eventually drift back towards VR. Sorry, but your Mac laptop just won’t cut it. Hello Alienware!
Now that I had the hardware, it was time to get the software working. I had never done anything with voice command technology before so I really didn’t have any clue where to start. I knew I wanted to use Kinect and Unity but that’s about it.
After scouring the internet for demos and examples, I found only one instance of potentially helpful experimentation. A gentleman named Rumen Filkov had published a package of demos using Unity and Kinect. That package can be found in the Unity Asset Store here.
I contacted Rumen directly, after learning that he was offering the package free to schools and universities. He responded very quickly and provided me a link to download them.
I was able to get most of the demos up and running very quickly however the speech recognition demo had some errors. I returned to the description of the package and found I needed to follow some additional installation steps in order to get it working.
First, I needed to install the Microsoft Speech Platform for Runtime v11 which can be found here.
I also needed to install the language packs that would be used by the Speech Platform. Those language packs can be found here.
I chose to only install ‘MSKinectLangPack_enUS.msi’ which I assumed was english based in the United States version.
After installing these two additional pieces of software, Rumen’s speech recognition demo fired up and ran flawlessly!
I can’t thank Rumen enough for this start. Without his demos, the road to getting something up and running would have been a much much slower one. A link to Rumen’s blog can be found here.
Thanks to Rumen, I now had a basic demo of how the Kinect, Unity and the Microsoft Speech Platform are able to talk to one another and how a user can talk to them. However, to really create something useful to me, I knew I needed a deeper understanding of what was going on here and how the technology was working.
As difficult as it was to find working examples of this technology combination, it was equally, if not more difficult, to find solid documentation of how to make it work.
Through Rumen’s example, I was able to realize that Microsoft’s Speech Platform relies on a Grammar File, that must be imported at runtime, for speech recognition. The Grammar File is a strictly formatted XML document. Examples and rules for formatting the Grammar File can be found here.
Inside the Grammar File you basically define words or phrases that can be verified by the Speech API. Once a word or phrase is verified, you can then return something to your code indicating which word or phrase was recognized. You can also write rules in the Grammar File that can expand the verification process through a tree of possible results.
In Rumen’s example, the returned item from the Grammar File is always a String. However, the examples found in Microsoft’s documentation indicate that the Grammar File should be capable of returning an Object with Properties. I am still unclear on if there is a limitation on the capabilities of Kinest, Unity or with Rumen’s code example, that limits the return as a String.
I realized this potential limitation was important when I determined that, for my purposes, I will often need to know more than one piece of information from the speech verification process.
A classroom is filled with multiple students. Our simulated classroom will be as well, so it would seem important that the user of this simulation be able to talk to both the class as a whole or to any individual student. Therefore, after the user speaks, I must have the speech verification process return both who they are speaking to and what action they are requesting of them.
If the return from the Grammar File is limited to a String, the XML has the potential to get very lengthy. Especially if I need to create tags that allow every individual to perform every action. What I mean by that is say there are 10 possible actions (“Sit down”, “Raise Hand”, “Answer Question”, etc…) and there are 30 students (“Steve”, “Bob”, “Suzy”, etc…) in the class. This means there are 300 possible String returns from the Grammar File. (” Sit Down Steve”, “Sit Down Bob”, “Sit Down Suzy”, “Raise Hand Steve”, “Raise Hand Bob”, “Raise Hand Suzy”, “Answer Question Steve”, “Answer Question Bob”, “Answer Question Suzy”, etc…)
Hand writing out all these possibilities in the XML would not only be tedious but it would also be bad programming since, as I mention before, you can create branching rules in the Grammar File to prevent this.
According to Microsoft’s documentation, you should be able to receive an Object from the Grammar File that would have properties for, in our case, action and name. Without knowing why I can only get a String value back from speech verification, I came up with a work around where I grab the action and the name, using branching grammar rules, then separate the action and name by a comma and combine them into a String. (“Answer Question, Steve”)
I then wrote code in Unity to parse the returned String into an Array using the comma as the separator. ([“Answer Question”, “Steve”]) Now I have a two item list. One item in the list is an action and the other item is a name.
But what about the order of the list? What if name comes before the action or vice versa ?
Well, to my Unity code, it doesn’t make any difference, since we will be looking over the entire list for an action and looking over the entire list for a name. (It’s a small list) But to the speech verification the order makes a big difference!
We will want the user of this simulation to feel as natural as possible. Part of the natural way we talk is by addressing who we are talking to at either end of a sentence. (“Could you raise your hand Steve?” or “Steve could you raise your hand?”)
To speech verification these are two very different phrases and it further emphasizes the importance of using branching rules inside the Grammar File.
To further naturalized the manor to which the user can speak during the simulation we need to included different options for saying the same thing. For example “Steve” might also be addressed as “Steven” and “Bob” might be addressed as “Robert.” Actions might be phrase differently as well like “Could you raise your hand” might mean the same thing as “Do you have a question.” We can handle these alternatives by using a “one-of” list inside the Grammar File.
But what if a phrase is altered just slightly? Like “Could you raise your hand” is spoken “Could you please raise your hand?” Will this verify? Only testing will tell. However, there is another setting to speech verification that we can alter to help insure a recognized result. This setting is a confidence setting that sets how accurate a piece of speech should match a verified phrase before returning a positive match. This setting, I have found, also helps eliminate background noise from returning false positives.
The confidence setting is Float value and will require some experimentation on your part to get it right. Too low, and you will get false positives, too high and you will get no positives on speech verification.
That about covers, at a high level, what I have learned and experimented with so far about voice commands using Kinect, Unity and Microsoft’s Speech Platform. Below is a video demo of the project’s progress.
jasa SEO, Backlink, Blogwalking murah says
Write more, thats all I have to say. Literally, it seems as though you relied on the video to make your point. You clearly know what youre talking about, why waste your intelligence on just posting videos to your weblog when you could be giving us something enlightening to read?|
marinir seo 085-635-945-40 says
Hey There. I found your blog using msn. This is a very well written article. I’ll be sure to bookmark it and return to read more of your useful information. Thanks for the post. I’ll definitely comeback.|