International Corpus of English

Using VOICE-Online

9 Additional information

9.1 Differences between VOICE style and the VOICE Transcription Conventions [2.1]

The speech events included in VOICE were transcribed according to the VOICE Transcription Conventions [2.1] in a txt-format in VoiceScribe. Subsequently, the transcripts were transferred into a TEI-based XML-format and rendered into HTML with a set of XSL-transformation stylesheets for online display. VOICE style, which is one of the styles used for displaying search results and texts in VOICE Online, generally corresponds to the mark-up and features specified in the VOICE Transcription Conventions [2.1]. However, there are some minor differences, which are listed below:

Utterance identifier In VOICE style, each speaker output is preceded by an utterance identifier which locates the utterance in the corpus and/or text. These utterance numbers are an additional feature of VOICE style and VOICE Online. They are not specified in the VOICE Transcription Conventions [2.1].
Use of colours While the VOICE Transcription Conventions [2.1] only specify a colour for overlapping speech (blue), most tags and mark-up features are indicated in different colours in VOICE style (similar to the coloured highlighting used in VoiceScribe ).
Tags for non-English speech The language abbreviations in the tags for non-English speech used in VOICE style conform to the ISO 639-2 ‘Codes for the representation of names of languages’, i.e. they consist of 3-letter codes (e.g., fre for French, ger for German, and ita for Italian) instead of the 2-letter codes used in the VOICE Transcription Conventions [2.1]. This change was implemented because the 3-letter codes of the ISO 639-2 list are based on the English words for the respective languages (e.g., ger for German and spa for Spanish instead of de for Deutsch and es for Espanol) and were thus considered to be more transparent for an international audience. The ISO 639-2 is also more comprehensive and includes more languages than the ISO 639-1 list, which the 2-letter codes are based on.
Tag for reading aloud The speaking mode <reading_aloud> </reading_aloud is represented with an underline character in VOICE online but with whitespaces in the VOICE Transcription Conventions [2.1].
Whitespaces Due to the automatic conversion of transcripts into XML and the automatic online represention via XSLT-stylesheets, additional whitespaces may occasionally occur in transcripts in VOICE Online (e.g. between an emphasized word and an intonation marker which follows the word). Since these whitespaces NEVER affect word boundaries (i.e. they never occur within words), they do not alter the content of the transcript and thus do not affect searches.

9.2 List of tags and mark-up

This list provides a short overview of the different tags and mark-up features used in VOICE transcripts. For more detailed information see the VOICE Transcription Conventions [2.1].

S1:, S2:, S3:, ...Identified speakers
SS:Group of speakers
SX:, SX-f:, SX-m:, SX-1:, ...Speakers not identified
text?, text. Intonation
TEXT Emphasis
(.), (1), (3),... Pauses
<1> </1>, <2> </2>, ... Overlaps
= Other-continuation
te:xt Lengthening
tex- Word fragments
<@> </@> Laughingly spoken
(text)Uncertain transcription
<pvc> </pvc>Pronunciation variations and coinages
<ono> </ono> Onomatopoeic noises
<L1scc> </L1scc>, <LNfre> </LNfre>, ...Non-English speech
<spel> </spel> Spelling out
<fast> </fast>, <whispering> </whispering>, <imitating> </imitating>, ...Speaking modes
hh, hhh Breath
<coughs>, <applauds>, <clears throat>, ... Speaker noises
[S1], [org1], [place1], [last name1], ... Anonymization
{parallel conversation between S1 and S2 starts}, {telephone rings}, {S1 leaves the room}, ... Contextual events
<un> xxxx </un> Unintelligible speech
(gap 00:02:23), (nrec 00:50:00), ... Transcription borders, untranscribed portions

9.3 Browser recommendations

Browsers which are currently available differ in their performance, their compliance with standards, and their rendering of layouts. Great care has been taken to ensure that VOICE Online adheres to the standards and requirements of the most popular browsers. Nevertheless, we cannot guarantee that the user experience is equally satisfactory with every browser.

VOICE Online should work fast and reliably with Google Chrome, Mozilla Firefox, and Apple Safari. Minor performance issues concerning the display of whole corpus texts may be encountered with current versions of MS Internet Explorer, depending on the speed of the client computer. If Internet Explorer asks whether the browser should ‘stop running this script’, choose ‘No’ or try one of the above browsers. All of them can be downloaded from the linked websites.

9.4 Word boundaries and word counting

For the purposes of counting the number of words in VOICE transcripts, words are defined as strings bounded by spaces and/or apostrophes.

The wordcount for VOICE transcripts counts as an individual word all words and word-like forms which appear within an utterance, i.e. between (but excluding) the speaker ID (e.g. S1:, SX-f:, or SX-13:) and the end of the utterance concerned. The following forms are thus also counted as words: word fragments, repeated words, lexicalized reduced forms, uncertain transcription, unintelligible parts, acronyms, pvcs, spelled items (the letters within one <spel> </spel> tag, e.g. <spel> w o r d </spel>, count as one word), anonymized items (one anonymized item such as [last name1] counts as one word), non-English speech, onomatopoeic sounds, sequences of laughter, and discourse markers.

Hyphenated words (e.g., university-level, semi-final, cross-exam) are counted as one word. Words separated by apostrophes (e.g., he's, we've, doesn’t, university’s) are counted as two words.

All mark-up features, both within and outside the boundaries of utterances, are NOT included in the wordcount. This means that tags, speaker IDs, indication of transcription borders, gaps, and contextual information as well as pauses, breath, speaking modes, speaker noises, and non-verbal feedback are NOT counted as words.

