Home > People & Projects > Automatic annotation of the Spoken BNC

Project Details

not specified
Project Name: 
Automatic annotation of the Spoken BNC
Principal Investigator / Director: 
John Coleman and Mark Liberman
Oxford participants: 
John Coleman; John Pybus (Main Contact)
Other Participants: 
not specified
  • Division: Humanities
  • Unit: Linguistics, Philology & Phonetics Faculty
  • Sub-Unit: Phonetics Laboratory
Start Date: 
01/01/2010
End Date: 
6/30/2011
Partner organizations (inside or outside Oxford): 
The Linguistic Data Consortium, University of Pennsylvania, and The British Library
Funder: 
JISC
Subject Area: 
Linguistics
Project Description: 

This work is being carried out in a sequence of projects. After an initial pilot, we conducted "Mining a Year of Speech", a project that addressed the challenge of providing rich, intelligent data mining capabilities for a substantial collection of spoken audio data in American and British English. We applied state-of-the art techniques - primarily "forced alignment" - to offer sophisticated, rapid and flexible access to a richly annotated corpus of a year of speech (about 9000 hours, 100 million words, or 2 Terabytes of speech), derived from the Linguistic Data Consortium, the British National Corpus, and other existing resources. This is at least ten times more data than has previously been used by researchers in fields such as phonetics, linguistics, or psychology, and more than 100 times common practice.

In an ESRC-funded follow-on project, "Word-joins in real-life speech", we are using the aligned, transcribed spoken BNC for sociophonetic research on how words are connected together in ordinary speech (see separate project record).

ICT Methods: 
CategorySub-HeadingsDetails
Data analysisAudiovisual AnalysisSound analysis
Searching/LinkingData mining
Data structuring and enhancementAudio-Visual Processing
Last updated: 
25/06/2015 16:24:50
Updated by: 
martinw@ox.ac.uk