Martina Bovell explains how ACER is making the computerised assessment of writing possible.
When it comes to the assessment of writing for VET, there are some common, and sometimes vexed, questions. What is an appropriate task to use? How should it be marked? Who will mark it? How much time will this take? Can I rely on the results? Often, it has been easier to avoid assessments of writing altogether, yet as industry focuses more sharply on literacy and numeracy skills in the workplace, there is a need for writing assessment tasks specifically targeted to the contexts and abilities of learners in the VET sector that offer a robust reporting system to deliver accurate, valid and reliable information of value to teachers and learners.
The assessment tool
A new writing assessment – part of ACER’s new Core Skills Profile for Adults – meets that need. Delivered online and automatically marked, it provides instant reports that can give summative, formative and diagnostic feedback to learners and teachers. The assessment builds on learners’ intrinsic motivation to use computers and, because of the automated marking system, frees up teacher time.
How the tool was developed
After writing a series of assessment tasks and piloting these with students from a variety of training organisations, we selected two for a full-scale trial. Both tasks addressed the Australian Core Skills Framework (ACSF) Personal and Community domain of communication and were suited to learners working within and towards ACSF Levels 2 – 4. The tasks do not require learners to draw on specialised knowledge but they do require writing to two different audiences for two different purposes, in line with the ACSF writing focus areas of range, audience and purpose, and register.
A criterion-referenced guide identifies eight marking criteria, each defined by an assessment focus and elaborated by between two and four ordered scoring categories, or ‘subskills.’ Exemplar scripts that show the marking standard for each ‘subskill’ are used by expert markers to obtain consistent, valid and reliable judgements. To develop the measurement scale, and to build computer scoring models, each trial script is double blind marked by expert markers and if scores are discrepant, adjudicated by a third marker.
Using the enormous analytic and computational power of computing and some smart programming, the parameters for machine marking pieces of writing are developed using a ‘training set.’ The training set consists of more than 300 scripts, all addressing the same writing task, and their accompanying set of finely calibrated human scores. The machine builds a scoring model, using algorithms derived from the training set, to score new, unseen essays written to the same topic. In the case of the two ACER writing assessments, two scoring models have been developed.
According to recent research, for example by Lawrence Rudner, Veronica Garcia and Catherine Welch, and by Mark Shermis and Ben Hamner, there is increasing evidence that machine scoring can replicate human markers’ scores at least as well as human markers can replicate each other’s scores. Even so, there are concerns about the effect on teachers and test takers. As researchers such as Joanne Drechsel and Sara Weigle have noted, if machine scoring focuses only on the mechanics of writing (syntax, spelling and punctuation) at the expense of the cognition of meaning making, this may have washback to teaching and test taking situations. For this reason, we’ve investigated the quality of score replication of the two ACER writing assessments.
We found that, on both tasks, when all eight criteria scores were summed, correlations between each of the expert human markers and the machine were as high as the correlation between the two human markers. There was also little difference in the quality of machine scoring across tasks.
Table 1: Correlations between human markers and machine scoring
N = 334
N = 332
|Marker 1 and 2||0.88||0.85|
|Marker 1 and Machine||0.89||0.86|
|Marker 2 and Machine||0.89||0.88|
On all but one of the eight criteria scores across both tasks, machine scoring replicates the human markers’ scores at least as well as human markers can replicate each other’s scores. Interestingly, machine scoring did not score any better than human scoring on criteria that address the mechanics of writing. In fact, it appears to be more reliable than human markers when scoring criteria that address audience, purpose and meaning making.
Is the machine infallible?
There are rare cases when writing cannot be scored by the machine, such as when a piece of writing consists of only a few words, contains overwhelmingly poor spelling or many foreign words, lacks punctuation or is off topic. In such instances, there is either not enough writing provided for the computer to score, or the content is so unlike the scripts in the training set that the computer cannot apply the scoring model. Since a copy of each student’s writing is always available in the reporting system, however, nonscored scripts can be reviewed by the teacher.
The individual student report provides a graphical display of the student’s scale score on the assessment continuum and mapped to ACSF levels. Scores for all writing subskills assessed on both tasks are overlayed on the assessment continuum and shown numerically. Strengths and weaknesses are highlighted and teachers gain rich information for developing learning plans for individual students.
The student response report contains writing submitted for scoring and can be used by both the teacher and student as a basis for discussion.
The group report in graph form enables teachers to see at a glance the achievements of all students against the ACSF and writing subskills and use this to plan learning that targets the whole group.
The computerised assessment of writing aims to provide you with an accurate and efficient way to assess literacy skills in the VET sector to support you and your students in teaching and learning. ■
Drechsel, J. (1999). Writing into silence: Losing voice with writing assessment technology. Teaching English in the two-year college. 26: 380–87.
Rudner, L.M., Garcia, V. & Welch, C. (2006). An evaluation of the IntelliMetric essay scoring system. Journal of Technology, Learning, and Assessment. 4(4): 3–21.
Shermis, M.D. & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis. Available at <http://dl.dropbox.com/u/44416236/NCME%202012%20Paper3_29_12.pdf>
Weigle, S.C. (2002). Assessing Writing. Cambridge: Cambridge University Press.