Some interesting facts about why speech technologists might want to focus on India
India is the world’s richest and most diverse linguistic region. Here is a quote from a report published by the Census commission of India ( www.censusindia.net ):
It may be important to note that innumerable mother tongues are returned at every census. For example, in 1961 and 1971 censuses the total number of mother tongues returned was around 3,000, in 1981 around 7,000 and in 1991, these were more than 10,000. These vast raw returns need to be identified and classified in terms of actual languages and dialects to present a meaningful linguistic picture of the country. This operations of linguistic identification of raw mother tongue returns or linguistic rationalization and classification, produced a list of rationalized mother tongues in each census: For example, the list produced in 1961 had 1652 mother tongue names, in 1991 it was 1576. These 1576 rationalized mother tongues were further classified following the usual linguistic methods and grouped under appropriate languages. The total number of languages so arrived at was 114 in 1991 Census.
There are over 10000 distinct linguistic forms spoken per the census commission, which, even after extreme grouping result in well over 100 linguistic groups that are spoken by several million people. The commission also reports that nearly 20% of the Indian population is bilingual, while about 7.5% is trilingual. The reality is more complex – language is a continuum in India, which evolves over a travel as short as a few hundred kilometers, often into an entirely new and (with respect to the language spoken at the origin) unrecognizable form. In addition, multilinguality is often insidious and not explicit – words, phrases, and grammatical structures from other languages will very frequently be used within a language, often unwittingly. More interestingly to a speech technologist, even when the linguistic structure follows that of a second, third or fourth language, the phonetic structure usually remains that of the native language.
For speech technologists, this is incredibly challenging turf. Building voice systems for English is hard enough. As a research area, speech recognition technology has largely been developed around western languages. In recent years the spotlight has shifted to more clearly focus on Chinese, Japanese and other far-eastern languages, and since 9/11, to Arabic and the middle east. More than ever, researchers are realizing the difficulty in deploying systems that have developed around primarily Western languages to these languages, and have found that they have to rethink some of the basic algorithms and strategies used in the the core systems for voice recognition. Conside the fact that despite over five decades of research speech recognition systems in English are still far from perfect. Porting these technologies to new and relatively poorly-studied languages is far from straightforward, and will need significant innovation and improvement to make them suitable for widespread use in Indian languages.
