International Corpus of English
home home

Frequently Asked Questions

What does VOICE stand for?

VOICE stands for Vienna-Oxford International Corpus of English. ‘Oxford’ is a constituent of the name VOICE because the Oxford University Press supported the VOICE project financially in its initial phase. ‘Vienna’ points to the location of the corpus compilation at the University of Vienna.

What is a corpus?

“In the language sciences a corpus is a body of written text or transcribed speech which can serve as a basis for linguistic analysis and description.” (p.1)
Kennedy, Graeme. 1998. An introduction to corpus linguistics. London: Longman.

What is English as a lingua franca (ELF)?

English as a lingua franca (ELF) can be thought of as "any use of English among speakers of different first languages for whom English is the communicative medium of choice, and often the only option." (Seidlhofer 2011: 7)
ELF is currently the most common use of English world-wide. Millions of speakers from diverse cultural and linguistic backgrounds use ELF on a daily basis, routinely and successfully, in their professional, academic and personal lives.
Seidlhofer, Barbara. 2011. Understanding English as a Lingua Franca. Oxford: OUP.

Why does VOICE focus on spoken data?

Spoken interactions are immediate and at a remove from the stabilizing and standardizing influence of writing. They are overtly reciprocal and reveal the online negotiation of meaning in the production and reception of utterances, thus facilitating observations regarding mutual intelligibility among interlocutors.

How big is VOICE?

The current size of VOICE 2.0 Online is just over 1 million words of spoken ELF, equalling 110 hours and 35 minutes of recorded and transcribed interactions.

Which first languages are represented in VOICE?

Since the focus of VOICE at this stage is primarily, but not exclusively, on Europe, all major first languages spoken across Europe are represented in the corpus. In sum, VOICE encompasses 49 different, also non-European, first languages.

Does VOICE/ELF include native speakers of English?

ELF interactions often also include speakers from backgrounds where English is used as a first or second language. The VOICE project therefore works with a definition of ELF which includes English native speakers as well. Nevertheless, so-called non-native speakers of English commonly outnumber English native speakers in ELF interactions, a fact also represented in VOICE. Speakers who have English as a first language only make up about 7 per cent of all speakers recorded in VOICE.

Does VOICE Online include audio files?

As of 24 November 2010, 23 recordings of transcribed speech events can also be listened to. The anonymized audio material is freely accessible from within the VOICE Online interface after a free registration for the VOICE Online services. The audio material covers approximately 22 hours of field-recordings, which equals about 20% of the entire corpus. We trust that this new feature will further increase the value of VOICE for research. For detailed information on using the new audio features, please refer to the subsection audio files in Using VOICE Online.

Is VOICE available for download?

As of 5 May 2011, VOICE XML has become available for download. VOICE XML is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License and includes all corpus texts in XML format as well as derived HTML and TXT versions of the corpus with reduced mark-up. Since 22 January 2013, the download package also includes VOICE POS XML, a part-of-speech tagged and lemmatized Version of the corpus.

Can I search for the speakers' first languages in VOICE Online?

There is no automatic search function to detect utterances by a specific L1 in VOICE Online because the corpus is not balanced for L1s (see Corpus Statistics. However, the information on all speakers' L1s can be found in the header of each individual transcript and via the 'speaker information pop-up' which lists the occurrences of individual speakers in other speech events. When working with the XML versions of VOICE, individual search queries, such as for speakers-related information, can be created (see the README file in the XML download package for information on the coding format for the speakers' languages)

Where can I find which abbreviations for speakers' first languages correspond to which L1s? (e.g. L1=ger-IT)

The languages are abbreviated according to the ISO 639-2 Codes for the representation of names and languages. The corresponding countries are abbreviated according to the corresponding ISO 3166-1-alpha-2 codes. However, ISO codes are dynamic and some have changed since VOICE was first released, hence see ISO 639-2/RA Change Notice for any changes since the VOICE release.

When a speaker has two L1s, for instance French and German, are the utterances made by this speaker counted for both languages?

Yes. The tokens of speakers who have more than one L1 are counted under each of those languages. This is also the reason why the number of tokens given in the corpus statistics for first languages is in total greater than the total number of tokens in the corpus itself and the percentages given in the corpus statistics amounts to more than 100%.