|
Sign In to gain access to subscriptions and/or personal tools.
|
Evaluating rater responses to an online training program for L2 writing assessment
Catherine Elder
University of Melbourne, caelder{at}unimelb.edu.au
Gary Barkhuizen
University of Auckland
Ute Knoch
University of Auckland
Janet von Randow
University of Auckland
The use of online rater self-training is growing in popularity and has obvious practical benefits, facilitating access to training materials and rating samples and allowing raters to reorient themselves to the rating scale and self monitor their behaviour at their own convenience. However there has thus far been little research into rater attitudes to training via this modality and its effectiveness in enhancing levels of inter- and intra-rater agreement.
The current study explores these issues in relation to an analytically-scored academic writing task designed to diagnose undergraduates English learning needs. 8 ESL raters scored a number of pre-rated benchmark writing samples online and received immediate feedback in the form of a discrepancy score indicating the gap between their own rating of the various categories of the rating scale and the official ratings assigned to the benchmark writing samples.
A batch of writing samples was rated twice (before and after participating in the online training) by each rater and Multifaceted Rasch analyses were used to compare levels of rater agreement and rater bias (on each analytic rating category). Raters views regarding the effectiveness of the training were also canvassed.
While findings revealed limited overall gains in reliability, there was considerable individual variation in receptiveness to the training input. The paper concludes with suggestions for refining the online training program and for further research into factors influencing rater responsiveness.
References
- Barritt, L., Stock, P. and Clark, F. 1986: Researching practice: evaluating assessment essays . College Composition and Communication 37, 315-327 .
- Cason, G. J. and Cason, C. L. 1984: A deterministic theory of clinical performance rating . Evaluation and the Health Professions 7, 221-247 .[Abstract/Free Full Text]
- Charney, D. 1984: The validity of using holistic scoring to evaluate writing: a critical overview . Research in the Teaching of English 18, 65-81 .
- Congdon, P. J. and McQueen, J. 2000. The stability of rater severity in large-scale assessment programs . Journal of Educational Measurement 37, 163-178 .[CrossRef]
- Elder, C., McNamara, T. and Congdon, P. 2003. Rasch techniques for detecting bias in performance of native and non-native speakers on a test of academic English . Journal of Applied Measurement 4, 2-2, 181-197 .[Medline]
[Order article via Infotrieve]
- Hamilton, J., Reddel, S. and Spratt, M. 2001: Teachers perceptions of online rater training and monitoring . System 29, 505-520 .[CrossRef]
- Huot, B. 1990: Reliability, validity, and holistic scoring: What we know, what we need to know . College Composition and Communication 41, 201-213 .
- Kenyon, D. and Stansfield, C. W. 1993: Evaluating the efficacy of rater self-training. Washington, DC: Center for Applied Linguistics .
- Linacre, J. M. 1989: Many-faceted Rasch measurement. Chicago, IL: MESA Press .
- Linacre, J. M. and Wright, B. D. 1993: A Users Guide to FACETS (Version 2.6). Chicago, IL: MESA Press .
- Lumley, T. and McNamara, T. 1995: Rater characteristics and rater bias: implications for training . Language Testing 12, 54-71 .[Abstract/Free Full Text]
- Lunz, M. E. and Stahl, J. A. 1990: Judge consistency and severity across grading periods . Evaluation and the Health Professions 13, 425-444 .[Abstract/Free Full Text]
- Lunz, M. E., Wright, B. D. and Linacre, J. M. 1990: Measuring the impact of judge severity on examination scores . Applied Measurement in Education 3, 331-345 .
- McIntyre, P.N. 1993: The importance and effectiveness of moderation training on the reliability of teachers assessment of ESL writing samples. Unpublished MA thesis, University of Melbourne.
- McNamara, T. 1996: Measuring second language performance. Harlow, Essex: Pearson Education .
- Moore, T. and Morton, J. 1999: Authenticity in the IELTS Academic Module Writing Test: A comparative study of Task 2 items and university assignments. IELTS Research Reports 2. Canberra: IELTS Australia
- Myford, C. M. and Wolfe, E. W. 2000: Monitoring sources of variability within the Test of Spoken English Assessment System. TOEFL Research Report 65. Princeton, NJ: Educational Testing Service .
- Reed, D. J. and Cohen, A. D. 2001: Revisiting raters and ratings in oral language assessment. In Elder, C., Brown, A., Grove, E., Hill, K., Iwashita, N., Lumley, T., McNamara, T. and OLoughlin, K., editors, Experimenting with uncertainty. Essays in honour of Alan Davies. Cambridge: Cambridge University Press .
- Rosenfeld, M., Leung, S. and Oltman, P. 2001: The reading, writing and listening tasks important for academic success at undergraduate and graduate levels. TOEFL Monograph Series 21. Princeton, NJ: Educational Testing Service .
- Smith, S.D. 2003: Standards for academic writing: are they common within and across disciplines? Unpublished Masters thesis, University of Auckland.
- Stahl, A. and Lunz, M. E. 1992: Judge performance reports: media and message . Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
- Weigle, S.C. 1994a: Effects of training on raters of English as a second language compositions: quantitative and qualitative approaches. Unpublished PhD dissertation, University of California, Los Angeles.
- Weigle, S. C. 1994b: Effects of training on raters of ESL compositions . Language Testing 11, 197-223 .[Abstract/Free Full Text]
- Weigle, S. C. 1998: Using FACETS to model rater training effects . Language Testing 15, 263-287 .[Abstract/Free Full Text]
- Wigglesworth, G. 1993: Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction . Language Testing 10, 305-323 .[Abstract/Free Full Text]
Language Testing, Vol. 24, No. 1,
37-64 (2007)
DOI: 10.1177/0265532207071511

CiteULike Complore Connotea Del.icio.us Digg Reddit Technorati Twitter What's this?
This article has been cited by other articles:

|
 |

|
 |
 
U. Knoch
Diagnostic assessment of writing: A comparison of two rating scales
Language Testing,
April 1, 2009;
26(2):
275 - 304.
[Abstract]
[PDF]
|
 |
|
|
|