|
Sign In to gain access to subscriptions and/or personal tools.
|
vocd: A theoretical and empirical evaluation
Philip M. McCarthy
University of Memphis, USA, pmmccrth{at}memphis.edu
Scott Jarvis
Ohio University, USA
A reliable index of lexical diversity (LD) has remained stubbornly elusive for over 60 years. Meanwhile, researchers in fields as varied as stylistics, neuropathology, language acquisition, and even forensics continue to use flawed LD indices — often ignorant that their results are questionable and in some cases potentially dangerous. Recently, an LD measurement instrument known as vocd has become the virtual tool of the LD trade. In this paper, we report both theoretical and empirical evidence that calls into question the rationale for vocd and also indicates that its reliability is not optimal. Although our evidence shows that vocd's output (D) is a relatively robust indicator of the aggregate probabilities of word occurrences in a text, we show that these probabilities — and thus also D — are affected by text length. Malvern, Richards, Chipere and Durán (2004) acknowledge that D (as calculated by vocd's default method) can be affected by text length, but claim that the effects are not significant for the ranges of text lengths with which they are concerned. In this paper, we explain why D is affected by text length, and demonstrate with an extensive empirical analysis that the effects of text length are significant over certain ranges, which we identify.
References
- Arnaud, P.J.L. 1984: The lexical richness of L2 written productions and the validity of vocabulary tests. In Culhane, T., Klein-Braley, C. and Stevenson, D.K., editors, Practice and problems in language testing: Papers from the International Symposium on Language Testing. Colchester: University of Essex, 14—28.
- Avent, J.R. and Austermann, S. 2003: Reciprocal scaffolding: A context for communication treatment in aphasia. Aphasiology 17: 397—404.[CrossRef]
- Bernstein Ratner, N. 1988: Patterns of parental vocabulary selection in speech to very young children. Journal of Child Language 15: 481—92.[Medline]
[Order article via Infotrieve]
- Bernstein Ratner, N. and Silverman, S. 2000: Parental perceptions of children's communicative development at stuttering onset. Journal of Speech, Language, and Hearing Research 43: 1252—63.[Abstract/Free Full Text]
- Biber, D. 1988: Variation across speech and writing. Cambridge: Cambridge University Press.
- Biggs, A., Daniel, L., Feather, R.M., Ortleb, E., Rillero, P., Snyder, S.L. and Zike, D. 2003: Glencoe Science: Science level green. New York: Glencoe/McGraw-Hill.
- Bucks, R.S., Singh, S., Cuerden, J.M. and Wilcock, G.K. 2000: Analysis of spontaneous, conversational speech in dementia of Alzheimer type: Evaluation of an objective technique for analyzing lexical performance. Aphasiology 14: 71—91.[CrossRef]
- Carrell, P.L. and Monroe, L.B. 1993: Learning styles and composition. The Modern Language Journal 77: 148—62.[CrossRef]
- Carroll, J.B. 1964: Language and thought. Englewood Cliffs, NJ: Prentice-Hall.
- Chotlos, J.W. 1944: Studies in language behavior. IV. A statistical and comparative analysis of individual written language samples. Psychological Monographs 56: 75—111.
- Colwell, K., Hiscock, C.K. and Memon, A. 2002: Interviewing techniques and the assessment of statement credibility. Applied Cognitive Psychology 16: 287—300.[CrossRef]
- Daller, H., Van Hout, R. and Treffers-Daller, J. 2003: Lexical richness in the spontaneous speech of bilinguals, Applied Linguistics 24: 197—222.[Abstract]
- Dempsey, K.B., McCarthy, P.M. and McNamara, D.S. Using phrasal verbs as an index to distinguish text genres. In D. Wilson and G. Sutcliffe (eds.), Proceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference (pp. 217—222). Menlo Park, California: The AAAT Press.
- Dempsey, K.B., McCarthy, P.M. and McNamara, D.S. 2006: Identifying text genres using phrasal verbs. Proceedings of the 28th annual conference of the Cognitive Science Society, Vancouver, Canada.
- Dickens, C. 1995: A tale of two cities. Longman: London.
- Dugast, D. 1978: Sur quoi se fonde la notion d'étendue théoretique du vocabulaire? Le Français Moderne 46: 25—32.
- ——— 1979: Vocabulaire et stylistique. I Théâtre et dialogue. Travaux de linguistique quantitative. Geneva: Slatkine-Champion.
- Durán, P., Malvern, D., Richards, B. and Chipere, N. 2004: Developmental trends in lexical diversity. Applied Linguistics 25: 220—42.[Abstract]
- Ertmer, D.J., Strong, L.M. and Sadagopan, N. 2002: Beginning to communicate after cochlear implantation: Oral language development in a young child, Journal of Speech, Language, and Hearing Research 46: 328—40.[CrossRef]
- Grela, B.G. 2002: Lexical verb diversity in children with Down syndrome. Clinical Linguistics & Phonetics 16: 251—63.[CrossRef][Medline]
[Order article via Infotrieve]
- Guiraud, P. 1960: Problèmes et méthodes de la statistique linguistique. Dordrecht: D. Reidel.
- Harris Wright, H., Silverman, S.W. and Newhoff, M. 2003: Measures of lexical diversity in aphasia, Aphasiology 17: 443—52.[CrossRef]
- Heaps, H.S. 1978: Information retrieval: Computational and theoretical aspects. New York: Academic Press.
- Herdan, G. 1960: Quantitative linguistics. London: Butterworth.
- Hess, C.W., Sefton, K.M. and Landry, R.G. 1986: Sample size and type-token ratios for oral language of preschool children, Journal of Speech and Hearing Research 29: 129—34.[Medline]
[Order article via Infotrieve]
- Holmes, D.I. and Singh, S. 1996: A stylometric analysis of conversational speech of aphasic patients. Literary and Linguistic Computing 11: 133—40.[Abstract]
- Holmes, J., Vine, B. and Johnson, G. 1998: Guide to the Wellington Corpus of Spoken New Zealand English. Wellington: School of Linguistics and Applied Language Studies, Victoria University of Wellington.
- Honoré, A. 1979: Some simple measures of richness of vocabulary, Association for Literary and Linguistic Computing Bulletin 7: 172—77.
- Hoover, D. 2003: Another perspective on vocabulary richness. Computers and Humanities 37: 151—78.[CrossRef]
- International Computer Archive of Modern and Medieval English 2000: Lancaster/Oslo/Bergen Corpus of British English (CD-ROM). ——— 2000: The London-Lund Corpus of Spoken English (CD-ROM).
- Jarvis, S. 2002: Short texts, best-fitting curves and new measures of lexical diversity, Language Testing 19: 57—84.[Abstract/Free Full Text]
- ——— 2003. Measuring lexical diversity through "exhaustive sampling". Paper presented at the Second Language Research Forum (SLRF), Tucson, AZ.
- Johansson, S., Leech, G. and Goodluck, H. 1978: Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computers. Oslo: Department of English, University of Oslo.
- Kucera, H. and Francis, W.N. 1967: Computational analysis of present-day American English. Providence, RI: Brown University Press.
- Linnarud, M. 1986: Lexis in composition. A performance analysis of Swedish learners' written English. (Lund Studies in English 74). Malmo: Liber Forlag (CWK Gleerup).
- Louwerse, M.M., McCarthy, P.M., McNamara, D.S. and Graesser, A.C. 2004: Variation in language and cohesion across written and spoken registers. In Forbus, K., Gentner, D. and Regier, T. (eds.), Proceedings of the twenty-sixth annual conference of the Cognitive Science Society. Cognitive Science Society, 843—48.
- Maas, H.D. 1972. Zusammenhang zwischen Wortschatzumfang und Länge eines Textes. Zeitschrift für Literaturwissenschaft und Linguistik 8: 73—79.
- MacWhinney B. 2000: The CHILDES project: Tools for analyzing talk, Vol. 2: The database, third edition. Mahwah, NJ: Lawrence Erlbaum.
- McCarthy, P.M. 2005: An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). Unpublished PhD dissertation, University of Memphis.
- McKee, G., Malvern, D. and Richards, B. 2000: Measuring vocabulary diversity using dedicated software, Literary and Linguistic Computing 15: 323—37.[Abstract]
- Malvern, D.D. and Richards, B.J. 1997: A new measure of lexical diversity. In Ryan, A. and Wray, A., editors, Evolving models of language. Clevedon: Multilingual Matters, 58—71.
- ——— 2000: Validation of a new measure of lexical diversity. In Beers, M., Bogaerde, B. v.d., Bol, G., de Jong, J. and Rooijmans, C., editors, From sound to sentence: Studies on first language acquisition. Groningen: Centre for Language and Cognition, University of Groningen.
- Malvern, D.D. Richards, B.J., Chipere, N. and Durán, P. 2004: Lexical diversity and language development: Quantification and assessment. Houndmills, Hampshire: Palgrave Macmillan.
- Meara, P. and Bell, H. 2001: P_Lex: A simple and effective way of describing the lexical characteristics of short L2 texts. Prospect 16: 5—19.
- Michéa, R. 1969: Répétition et variété dans l'emploi des mots. Bulletin de la Société de Linguistique de Paris, 1—24.
- Miller, J.F. 1981: Quantifying productive language disorders. In Miller, J.F., editor, Research on child language disorders: A decade of progress. Austin, TX: Pro-Ed, 211—20.
- Orlov, Y.K. 1983: Ein model der häufigekeitsstruktur des vokabulars. In Guiter, H. and Arapov, M.V., editors, Studies on Zipf's Law. Bochum: Brockmeyer, 154—233.
- Owen, A.J. and Leonard, L.B. 2002: Lexical diversity in the spontaneous speech of children with specific language impairment: Application of D. Journal of Speech and Hearing Research 45: 927—37.[CrossRef]
- Ransdell, S. and Wengelin, Å. 2003: Socioeconomic and sociolinguistic predictors of children's L2 and L1 writing quality. Arob@se, 1—2, 22—29 http://www.arobase.to/somm.html
- Ratner, N. and Silverman, S. 2000: Parental perceptions of children's communicative development at stuttering onset, Journal of Speech, Language, and Hearing Research 43: 1252—63.[Abstract/Free Full Text]
- Read, J. 2000: Assessing vocabulary. Cambridge: Cambridge University Press.
- Richards, B.J. and Malvern, D.D. 1997. Quantifying lexical diversity in the study of language development. New Bulmershe Papers. Reading: University of Reading.
- ——— 1998. A new research tool: Mathematical modeling in the measurement of vocabulary diversity (Award reference no. Rooo221995). Final Report to the Economic and Social Research Council, Swindon, UK.
- Rietveld, T. and van Hout, R. 1993: Statistical techniques for the study of language and language behavior. Berlin: Mouton de Gruyter.
- Shannon, C.E. 1948: A mathematical theory of communication. Bell System Technical Journal 27: 379—423, 623—56.
- Sichel, H.S. 1975: On a distributive law for word frequencies. Journal of the American Statistical Association 70: 542—47.[CrossRef]
- Silverman, S. and Bernstein Ratner, N. 2000: Word frequency distributions and type-token characteristics. Mathematical Scientist 11: 45—72.
- Singh, S. 2001: A pilot study on gender differences in conversational speech on lexical richness measures. Literary and Linguistic Computing 6: 251—64.
- Smith, J.A. and Kelly, C. 2002: Stylistic constancy and change across literary corpora: using measures of lexical richness to date works, Computers and the Humanities 36: 411—30.[CrossRef]
- Somers, H.H. 1966: Statistical methods in literary analysis. In Leeds, J., editor, The computer and literary style. Kent, OH: Kent State University, 128—40.
- Svartvik, J. and Quirk, R. 1980: A Corpus of English Conversation. Lund: CWK Gleerup.
- Swales, J. and Malczewski, B. 2001: Discourse management and new-episode flags in MICASE. In Simpson, R.C. and Swales, J.M., editors, Corpus linguistics in North America. Ann Arbor, MI: University of Michigan Press, 145—64.
- Templin, M. 1957: Certain language skills in children. Minneapolis, MN: University of Minneapolis Press.
- Thordardottir, E.T. and Ellis Weismer, S. 2001: High-frequency verbs and verb diversity in the spontaneous speech of school-age children with specific language impairment, International Journal of Language & Communication Disorders 36: 221—44.[Medline]
[Order article via Infotrieve]
- Tweedie, F.J. and Baayen, R.H. 1998: How variable may a constant be? Measures in lexical richness in perspective, Computers and the Humanities 32: 323—52.[CrossRef]
- Van Genderen, J.L. and Lock, B.F. 1977: Testing land-use map accuracy. Photogrammetric Engineering and Remote Sensing 43: 1135—37.
- Wright, H.H., Silverman, S.S. and Newhoff, M. 2003: Measures of lexical diversity in aphasia, Aphasiology 17: 443—52.[CrossRef]
- Wu, T. 1993. An accurate computation of the hypergeometric distribution function. ACM Transactions on Mathematical Software 19: 33—43.[CrossRef]
- Yule, G.U. 1944: The statistical study of literary vocabulary. Cambridge: Cambridge University Press.
Language Testing, Vol. 24, No. 4,
459-488 (2007)
DOI: 10.1177/0265532207080767

CiteULike Complore Connotea Del.icio.us Digg Reddit Technorati Twitter What's this?
This article has been cited by other articles:

|
 |

|
 |
 
J. M. Norris and L. Ortega
Towards an Organic Approach to Investigating CAF in Instructed SLA: The Case of Complexity
Applied Linguistics,
November 20, 2009;
(2009)
amp044v1.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
G. Yu
Lexical Diversity in Writing and Speaking Task Performances
Applied Linguistics,
June 4, 2009;
(2009)
amp024v1.
[Abstract]
[Full Text]
[PDF]
|
 |
|
|
|