Automatic annotation of the Spoken BNC

Linguistics, Philology & Phonetics Faculty, John Coleman and Mark Liberman - JISC

Principal Investigator / Director: John Coleman and Mark Liberman

Oxford participants: John Coleman; John Pybus

Other Participants: not specified

Project Webpage: http://www.phon.ox.ac.uk/mining

Start Date: 01/01/2010

End Date: 6/30/2011

Funder: JISC

Partner organizations (inside or outside Oxford): The Linguistic Data Consortium, University of Pennsylvania, and The British Library

Project Description:

This work is being carried out in a sequence of projects. After an initial pilot, we conducted "Mining a Year of Speech", a project that addressed the challenge of providing rich, intelligent data mining capabilities for a substantial collection of spoken audio data in American and British English. We applied state-of-the art techniques - primarily "forced alignment" - to offer sophisticated, rapid and flexible access to a richly annotated corpus of a year of speech (about 9000 hours, 100 million words, or 2 Terabytes of speech), derived from the Linguistic Data Consortium, the British National Corpus, and other existing resources. This is at least ten times more data than has previously been used by researchers in fields such as phonetics, linguistics, or psychology, and more than 100 times common practice.

In an ESRC-funded follow-on project, "Word-joins in real-life speech", we are using the aligned, transcribed spoken BNC for sociophonetic research on how words are connected together in ordinary speech (see separate project record).

Automatic annotation of the Spoken BNC

Other projects the participants have been involved in:

PI / Main Contact