Getting directions to the nearest Starbucks or Target is a task Apple’s virtual assistant can handle with ease. But what about local businesses with names that Siri has never heard, and might mistake for another phrase or the user misspeaking? To handle these, Apple has created libraries of hyper-local place names so Siri never hears “Godfather’s Pizza” as “got father’s piece.”
Speech recognition systems have to be trained on large bodies of data, but while that makes them highly capable when it comes to parsing sentences and recognizing phrases, it doesn’t always teach them the kind of vocabulary that you and your friends use all the time.
When I tell a friend, “let’s go to St John’s for a drink,” they know I don’t mean some cathedral in the midwest but the bar up the street. But Siri doesn’t really have any way of knowing that — in fact, unless the system knows that “Saint John’s” is a phrase in the first place, it might think I’m saying something else entirely. It’s different when you type it into a box — it can just match strings — but when you say it, Siri has to make her best guess at what you said.
But if Siri knew that in the Seattle area, when someone says something that sounds like St John’s, they probably mean the bar, then she can respond more quickly and accurately, without having to think hard or have you select from a list of likely saints. And that’s just what Apple’s latest research does. It’s out now in English, and other languages are likely only a matter of time.
To do this, Apple’s voice recognition team pulled local search results from Apple Maps, sorting out the “places of interest” — you (or an algorithm) can spot these, because people refer to them in certain ways, like “where is the nearest…” and “directions to…” and that sort of thing.
Obviously the sets of these POIs, once you remove national chains like Taco Bell, will represent the unique places that people in a region search for. Burger-seekers here in Seattle will ask about the nearest Dick’s Drive-in, for example (though we already know where they are), while those in L.A. will of course be looking for In-N-Out. But someone in Pittsburgh likely is never looking for either.
Apple sorted these into 170 distinct areas: 169 “combined statistical areas” as defined by the U.S. Census Bureau, which are small enough to have local preferences but not so small that you end up with thousands of them. The special place names for each of these were trained not into the main language model (LM) used by Siri, but into tiny adjunct models (called Geo-LMs) that can be tagged in if the user is looking for a POI using those location-indicating phrases from above.
So when you ask “who is Machiavelli,” you get the normal answer. But when you ask “where is Machiavelli’s,” that prompts the system to query the local Geo-LM (your location is known, of course) and check whether Machiavelli’s is on the list of local POIs (it should be, because the food is great there). Now Siri knows to respond with directions to the restaurant and not to the actual castle where Machiavelli was imprisoned.
Doing this cut the error rate by huge amount – from as much as 25-30 percent to 10-15. That means getting the right result 8 or 9 out of 10 times rather than 2 out of 3; a qualitative improvement that could prevent people from abandoning Siri queries in frustration when it repeatedly fails to understand what they want.
What’s great about this approach is that it’s relatively simple (if not trivial) to expand to other languages and domains. There’s no reason it wouldn’t work for Spanish or Korean, as long as there’s enough data to build it on. And for that matter, why shouldn’t Siri have a special vocabulary set for people in a certain jargon-heavy industry, to reduce spelling errors in notes?
This improved capability is already out, so you should be able to test it out now — or maybe you have been for the last few weeks and didn’t even know it.