Automatic annotation of the Spoken BNC
This work is being carried out in a sequence of projects. After an initial pilot, we conducted "Mining a Year of Speech", a project that addressed the challenge of providing rich, intelligent data mining capabilities for a substantial collection of spoken audio data in American and British English. We applied state-of-the art techniques - primarily "forced alignment" - to offer sophisticated, rapid and flexible access to a richly annotated corpus of a year of speech (about 9000 hours, 100 million words, or 2 Terabytes of speech), derived from the Linguistic Data Consortium, the British National Corpus, and other existing resources. This is at least ten times more data than has previously been used by researchers in fields such as phonetics, linguistics, or psychology, and more than 100 times common practice.
In an ESRC-funded follow-on project, "Word-joins in real-life speech", we are using the aligned, transcribed spoken BNC for sociophonetic research on how words are connected together in ordinary speech (see separate project record).