International Corpus of English
home home


VOICE POS XML 2.0 is the first downloadable XML version of VOICE that is annotated with part-of-speech tags and lemmatization. VOICE thus constitutes the first publicly available corpus of spoken ELF to be annotated in this way. VOICE POS XML was published in January 2013 and is made available as a free-of-charge resource under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) License, as VOICE XML ( It is based on the same source data as VOICE 2.0 Online and VOICE 2.0 XML, though there are a number of differences with regard to the encoding scheme (cf. the README file in the download package for a list of these). VOICE POS XML can be downloaded as part of the current download package of VOICE.

This version of VOICE was created by Barbara Seidlhofer (project director), Stefan Majewski, Ruth-Osimk-Teasdale, Marie-Luise Pitzl, Michael Radeka (project researchers) and Nora Dorn between June 2009 and January 2013 at the University of Vienna. The tokenisation was conducted by Stefan Majewski and Michael Radeka, lemmatization by Michael Radeka, and part-of-speech tagging by Ruth Osimk-Teasdale and Michael Radeka. Nora Dorn and Marie-Luise Pitzl contributed to the development of parts of the tagging methodology, and Leopold Lippert developed a categorisation scheme for the tagging of spelt items. The conversion to XML and TXT formats was done by Stefan Majewski and Michael Radeka. Henry Widdowson substantially contributed to the POS-tagging process with valuable ideas and helpful comments in many meetings and discussions, and through helping with the editing of numerous texts.

For the tokenisation, lemmatization and part-of-speech tagging of VOICE, available state-of-the-art tools and methodologies were considered and used. However, the unique data required novel combinations and extensions of these, and sometimes the development of a completely new, unconventional methodology. For a more detailed account of the tagging procedures and methodology used for VOICE POS, please consult the VOICE Part-of-Speech Tagging and Lemmatization Manual.

Before working with VOICE POS XML, we strongly recommend consulting the README file in the download package.

We are interested to learn about any work based on VOICE POS XML. Please do send us a message at

Recommended Citations

For citing for VOICE POS XML 2.0 see the subsection "Recommended Citations"