International Corpus of English
home home


As of January 2013, an updated XML version of VOICE is available for download. The download package now includes VOICE 2.0 XML, an updated version of the corpus with minor revisions in some of the corpus texts, as well as VOICE POS XML 2.0 , the first part-of-speech tagged and lemmatized version of VOICE, which is based on the same source code as VOICE 2.0 XML. The previously released versions VOICE 1.0 XML (which corresponds to the first release of VOICE 1.0 Online in May 2009) and VOICE 1.1 XML (an updated version of VOICE, released 5 May 2011) are also included. VOICE XML is made available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) License (

The Vienna-Oxford International Corpus of English (VOICE) was created by Barbara Seidlhofer (project director) and Angelika Breiteneder, Theresa Klimpfinger, Stefan Majewski, Marie-Luise Pitzl (project researchers) at the University of Vienna in a time-, labour- and cost-intensive process. Revisions and corrections to the corpus texts were made by Ruth Osimk-Teasdale. The tokenisation was carried out by Stefan Majewski and Michael Radeka. The corpus was lemmatized by Michael Radeka and part-of-speech tagged by Ruth Osimk-Teasdale and Michael Radeka. We are making VOICE available to be used for non-commercial research purposes. It is in this spirit that access to VOICE XML is provided free of charge to all those who are interested in using the corpus for such purposes.

VOICE XML broke new ground in 2011 in that it was the first corpus of English as lingua franca (ELF) to become publicly available for download. It is now also the first ELF corpus for which a lemmatized, part-of-speech tagged version can be downloaded. We have taken great care in the compilation of our (just over) 1-million-words corpus to meet the qualitative and technical standards of state-of-the-art corpus linguistics in data collection, transcription and encoding.

VOICE XML is released with a considerable amount of corpus documentation and additional materials for users' convenience (see the README file in the download package for details). Similarly, the corpus itself contains a substantial amount of meta-information on speech events and speakers in the corpus header as well as in individual text headers. We strongly encourage all users of VOICE to take note of this documentation and the meta-information when working with the corpus.

We would also like to encourage you to let us know when you publish or present work based on VOICE XML. Please send us a message at

Recommended Citations

For citing VOICE XML see the subsection "Recommended Citations"

Feedback on the use of VOICE XML is always very welcome. Please contact us via the online contact form (subject: VOICE XML) for any questions, comments or suggestions you may have.