When I took a speech rec class last semester, we had a guy from BBN (subsidiary of Raytheon) give a talk about large-scale, extremely fast audio transcription. As in, systems that could process audio 30-40,000 times faster than real-time. They traded off recognition accuracy to get this, so their accuracy was around 50-60%, as I recall. I asked why something like that would be useful, and he said if you're looking at a lot of data (which I heard: eavesdropping on an entire telephone network) then all you need is a general idea of what people are saying and a few keywords before you can zoom in on specific clips for more thorough analysis.
So yes, I'm sure NSA is interested.