Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

CiteULike is a free service for managing and discovering scholarly references - click here to get started.

Sign In to gain access to subscriptions and/or personal tools.
Language Testing
This Article
Right arrow Full Text (PDF)
Right arrow References
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Lumley, T.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Assessment criteria in a large-scale writing test: what do they really mean to the raters?

Tom Lumley

Hong Kong Polytechnic University, egluml{at}polyu.edu.hk

The process of rating written language performance is still not well understood, despite a body of work investigating this issue over the last decade or so (e.g., Cumming, 1990; Huot, 1990; Vaughan, 1991; Weigle, 1994a; Milanovic et al., 1996). The purpose of this study is to investigate the process by which raters of texts written by ESL learners make their scoring decisions using an analytic rating scale designed for multiple test forms. The context is the Special Test of English Proficiency (step), which is used by the Australian government to assist in immigration decisions. Four trained, experienced and reliable step raters took part in the study, providing scores for two sets of 24 texts. The first set was scored as in an operational rating session. Raters then provided think-aloud protocols describing the rating process as they rated the second set. A coding scheme developed to describe the think-aloud data allowed analysis of the sequence of rating, the interpretations the raters made of the scoring categories in the analytic rating scale, and the difficulties raters faced in rating.

Data show that although raters follow a fundamentally similar rating process in three stages, the relationship between scale contents and text quality remains obscure. The study demonstrates that the task raters face is to reconcile their impression of the text, the specific features of the text, and the wordings of the rating scale, thereby producing a set of scores. The rules and the scale do not cover all eventualities, forcing the raters to develop various strategies to help them cope with problematic aspects of the rating process. In doing this they try to remain close to the scale, but are also heavily influenced by the complex intuitive impression of the text obtained when they first read it. This sets up a tension between the rules and the intuitive impression, which raters resolve by what is ultimately a somewhat indeterminate process. In spite of this tension and indeterminacy, rating can succeed in yielding consistent scores provided raters are supported by adequate training, with additional guidelines to assist them in dealing with problems. Rating requires such constraining procedures to produce reliable measurement.

Language Testing, Vol. 19, No. 3, 246-276 (2002)
DOI: 10.1191/0265532202lt230oa


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?


This article has been cited by other articles:


Home page
Language TestingHome page
A. Gebril
Score generalizability of academic writing tasks: Does one test method fit it all?
Language Testing, October 1, 2009; 26(4): 507 - 531.
[Abstract] [PDF]


Home page
Applied LinguisticsHome page
G. Yu
Lexical Diversity in Writing and Speaking Task Performances
Applied Linguistics, June 4, 2009; (2009) amp024v1.
[Abstract] [Full Text] [PDF]


Home page
Language TestingHome page
U. Knoch
Diagnostic assessment of writing: A comparison of two rating scales
Language Testing, April 1, 2009; 26(2): 275 - 304.
[Abstract] [PDF]


Home page
Language TestingHome page
E. Schaefer
Rater bias patterns in an EFL writing assessment
Language Testing, October 1, 2008; 25(4): 465 - 493.
[Abstract] [PDF]


Home page
Language TestingHome page
H. Saito
EFL classroom peer assessment: Training effects on rating and commenting
Language Testing, October 1, 2008; 25(4): 553 - 581.
[Abstract] [PDF]


Home page
Language TestingHome page
T. Eckes
Rater types in writing performance assessments: A classification approach to rater variability
Language Testing, April 1, 2008; 25(2): 155 - 185.
[Abstract] [PDF]


Home page
Language TestingHome page
A. Cumming
Book reviews: Lumley, T. 2005: Assessing second language writing: the rater's perspective. Frankfurt: Peter Lang (Volume 3, Language Testing and Evaluation Series, edited by Rudiger Grotjahn and Gunther Sigott). 368 pp. ISBN 3-631-53327-6 US-ISBN 0-8204-7655-2 US$62.95
Language Testing, April 1, 2007; 24(2): 287 - 291.
[PDF]


Home page
Educational and Psychological MeasurementHome page
H. Breland, Y.-W. Lee, and E. Muraki
Comparability of TOEFL CBT Essay Prompts: Response-Mode Analyses
Educational and Psychological Measurement, August 1, 2005; 65(4): 577 - 595.
[Abstract] [PDF]


Home page
Language TestingHome page
R. Schoonen
Generalizability of writing scores: an application of structural equation modeling
Language Testing, January 1, 2005; 22(1): 1 - 30.
[Abstract] [PDF]