Vienna-Oxford
International Corpus of English
home home

Corpus Information

In order to make the best use of VOICE as a research resource, users will need to know what kind of data VOICE seeks to represent, how the data in the corpus were collected and transcribed, and how they relate to each other.

The present section therefore offers a concise overview and relevant details concerning the VOICE project, the corpus design, sampling principles, and the transcription of data. The information provided (together with details concerning speaker information) is duplicated in the Corpus Header, which is part of VOICE Online.

Vienna-Oxford International Corpus of English (VOICE)

Version 2.0 Online, January 2013

Project Director

  • Barbara Seidlhofer

Project Funding

  • VOICE is funded by FWF, the Austrian Science Fund (Project No. L448)
  • These funds were further supplemented by a contribution from Oxford University Press in 2008 and 2009. Supporting funds were also provided in the early pilot phase by Oxford University Press and by the Hochschuljubiläumsstiftung der Stadt Wien.

Size

1,023,082 orthographically defined words, totalling 110 hours 35 minutes and 56 seconds of recording.

Source Description

VOICE is based on audio-recordings of 151 naturally-occurring, non-scripted, face-to-face interactions involving 753 identified individuals from 49 different first language backgrounds using English as a lingua franca (ELF), i.e. English used as a common means of communication among speakers from different first-language backgrounds. The recordings were carried out between July 2001 and November 2007, usually using portable mini-disc recorders with external microphones. Most of the audio-recordings are supplemented by detailed field notes including information about the nature of the speech event and the interaction taking place as well as about the participants engaging in these ELF interactions. The interactions recorded are complete speech events from different domains (educational, leisure, professional) and of different speech event types (conversation, interview, meeting, panel, press conference, question-answer session, seminar discussion, service encounter, working group discussion, workshop discussion). The audio-recordings were transcribed, checked and proof-read by trained transcribers and researchers in accordance with the VOICE mark-up and spelling conventions [2.1] (see http://www.univie.ac.at/voice/page/transcription_general_information).

Details for each electronic text are given in the individual text headers.

The principles and practices underlying the selection and design of the corpus are documented in the project and sampling description.

Publication Statement

Barbara Seidlhofer, Stefan Majewski, Ruth Osimk-Teasdale, Marie-Luise Pitzl, Michael Radeka

Spitalgasse 2, AAKH Hof 8
1090 Vienna
Austria
Telephone: +43 1 427742446
Email: voice@univie.ac.at
Website: http://www.univie.ac.at/voice

The Vienna-Oxford International Corpus of English (VOICE) was created by Barbara Seidlhofer (project director) and Angelika Breiteneder, Theresa Klimpfinger, Stefan Majewski, Marie-Luise Pitzl (project researchers). Minor revisions were gathered by Ruth Osimk-Teasdale and Michael Radeka and corrections were made by Ruth Osimk-Teasdale. VOICE 2.0 Online (which is based on VOICE 2.0 XML) is freely available at the VOICE Project's website http://www.univie.ac.at/voice conditional on compliance with the Terms of Use specified there. The original audio files are held at the Department of English, University of Vienna. 23 selected audio files are available as audio streams in the VOICE Online interface at the VOICE Project's website http://www.univie.ac.at/voice.

The recommended citation for VOICE 2.0 Online is:

VOICE. 2013. The Vienna-Oxford International Corpus of English (version 2.0 Online). Director: Barbara Seidlhofer; Researchers: Angelika Breiteneder, Theresa Klimpfinger, Stefan Majewski, Ruth Osimk-Teasdale, Marie-Luise Pitzl, Michael Radeka http://voice.univie.ac.at (date of last access).

The short citation for VOICE 2.0 Online is:

VOICE. 2013. The Vienna-Oxford International Corpus of English (version 2.0 Online). http://voice.univie.ac.at (date of last access).

For further information about availability and copyright permissions, please see the Terms of Use. For further enquiries please contact the VOICE Project at voice@univie.ac.at.

Project and Sampling Description

The most wide-spread contemporary use of English throughout the world is that of English as a lingua franca (ELF), i.e. English used as a common means of communication among speakers from different first-language backgrounds (see Seidlhofer 2005 and Seidlhofer 2011). Nevertheless, linguistic descriptions have as yet focused almost entirely on English as it is spoken and written by its native speakers. The VOICE project seeks to redress the balance by providing the first general corpus capturing spoken ELF interactions as they happen naturally in various contexts. VOICE was designed and compiled to make possible a linguistic description of this most common contemporary use of English by providing a corpus of spoken ELF interactions which is freely accessible to linguistic researchers all over the world. The corpus is stored in a TEI-based XML format and rendered into HTML online with a set of XSL Transformation stylesheets.

The unit chosen for sampling data for inclusion in VOICE is that of the speech event. Speech events are (as far as practicalities allowed) included in their entirety. The speech events were selected for inclusion in the corpus on the basis of a set of seven external, i.e. non-linguistic, criteria, which therefore define the target population. Accordingly, VOICE captures speech events that fulfil the following criteria:

As to the sampling method used, subgroups of the target population were identified on the level of domain and target proportions specified for these as follows: Educational 25%, Leisure 10%, Professional-business 20%, Professional-organizational 35%, Professional-research/science 10%.

Short portions of some speech events were left untranscribed. Such gaps in the transcripts can occur for the following reasons: monologues exceeding ten minutes, scripted speech, sensitive content, non-English speech exceeding more than one utterance per speaker, unintelligible speech, longish explanations by VOICE researchers present. Such gaps in transcription are always indicated in the transcript, specifying the reason for the gap, the length of this untranscribed portion and some contextual information about what happens during the gap.

Domains: definitions

Domains in VOICE denote socially defined situations or areas of activity.
ED (educational):
The educational domain includes all social situations connected with institutions or people involved in teaching, training or studying.
LE (leisure):
The leisure domain includes all social situations occurring during the time that is spent doing something one chooses to do when one is not working or studying.
P (professional):
The professional domain includes all social situations connected with an activity that needs special expertise.
PB (professional business):
The professional business domain includes all social situations connected with activities of making, buying, selling or supplying goods or services for money.
PO (professional organizational):
The professional organizational domain includes all social situations connected with activities of international organizations or networks which are not doing research or business.
PR (professional research and science):
The professional research/science domain includes all social situations connected with the careful study of a subject, especially in order to discover new facts or information about it.

Speech Event Types: definitions

Speech Event Types (SPETs) in VOICE refer to particular types of speech event which are defined on the basis of purpose, type, and number of participants.
con (conversation):
A conversation is defined as a speech event at which people interact without a predefined purpose.
int (interview):
An interview is defined as a speech event at which questions are being asked and answered.
mtg (meeting):
A meeting is defined as a speech event at which a clearly defined group of people meets to discuss previously specified matters.
pan (panel):
A panel is defined as a speech event at which a group of specialists give their advice or opinion on a specified topic to an audience.
prc (press conference):
A press conference is defined as a speech event at which somebody talks to a group of journalists in order to answer their questions and/or to make an official statement.
qas (question-answer session):
A question-answer session is defined as a speech event at which members of an audience ask questions which are answered by specialist speakers.
sed (seminar discussion):
A seminar discussion is defined as a speech event at which a group of people meets for systematic study and/or work under the direction of one or more experts.
sve (service encounter):
A service encounter is defined as a speech event at which somebody seeks a service which is provided by somebody else.
wgd (working group discussion):
A working group discussion is defined as a speech event at which a (temporarily formed) subgroup of a larger group discusses a particular problem or question in order to suggest ways of dealing with it.
wsd (workshop discussion):
A workshop discussion is defined as a speech event at which a specific group of people exchanges views, ideas or information on a particular topic.

Transcription

The speech events included in VOICE are transcribed according to the VOICE Transcription Conventions [2.1], comprising the VOICE mark-up conventions and the VOICE spelling conventions. With the exception of four wide-spread lexicalized phonological reductions (cos, gonna, gotta, wanna) and all standard contractions, words are represented in full standard orthographic form. Specific mark-up, e.g. for lengthening, emphasis, speaking modes, rising and falling intonation, allows for selected prosodic features to be included in the transcripts. All false starts and repetitions are represented in the transcripts.

Based on TEI Guidelines and for the purposes of this transcription, an utterance in a speech event is normally taken to be "a stretch of speech usually preceded and followed by silence or by a change of speaker".

The speech events in VOICE also include switches into non-English speech. Generally, one utterance per person in non-English speech is transcribed, but longer turns in non-English speech are left untranscribed. If the transcriber is familiar with the language, non-English utterances are transcribed in full standard orthographic form, but excluding diacritics, umlauts, and non-Roman characters. Whenever possible, an approximate translation into English is provided.

Words are represented in British English spelling, following the Oxford Advanced Learner's Dictionary (7th edition), with the exception of 12 words (as well as their derivatives) which are spelt according to American English usage: center, theater, behavior, color, favor, labor, neighbor, defense, offense, disk, program, and travel (traveled, traveler, traveling).

Additionally, all words (verbs, nouns, etc.) which can be spelt with either -ise or -ize are spelt with the -ize variant in the transcripts. For the rationale behind this decision see Breiteneder, Pitzl, Majewski and Klimpfinger 2006.

In addition to manual checking and proof-reading, the individual transcripts were checked with the OpenOffice.org spellchecker.

Furthermore, spelling in the entire corpus was checked against the Oxford Advanced Learner's Dictionary lexicon.

Minor revisions and corrections in some of the corpus texts were made in July 2012.