Wednesday, July 08, 2015

Automatic Essay Scoring – kick–ass assessment?

Met Pierre Dillenbourg, from the Swiss Federal Institute of Technology, who tells me that their MOOCs, in French and English, are over the one million registration mark. His point was this is not about MOOCs per se but the influence that MOOCs are having on online learning in general. There are producing innovations and research, that work on scale, and progress we can build on. One of the MOOC innovations, given the huge numbers and impossibility of providing real teacher grading and feedback, is automatic essay grading.
Avoid hype and hysteria
First, need to avoid the claims that this software is already mature enough to replace all forms of traditional assessment. We also need to avoid the hysterical reactions from traditionalists who want to put a stake through the heart of anyone who suggests that it is useful and beneficial. There’s a middle way.
Not mimicking teachers
We need to start with some perspective on what this technology does. The trick here is not to see the AI as ‘being human’ and ‘reading’ the essay with real understanding. It does not. We didn’t develop flight by copying birds and building wings that flap aeroplanes into the air. We fly because we invented brilliant technology that flew – increasingly automatically. This is not about doing things exactly as humans do them. Machine grading does not have to have an in-depth semantic understanding of humans to make progress in grading and feedback. Google Translate and speech recognition systems do a good job and they’re getting better, on the basis of little semantic understanding but lots of smart algorithms and data. There are many ways to kick-ass in assessment.
Humans are good at many things, they’re also bad at many things. Simple calculators can calculate faster than any human. Simple calculators can calculate faster than any human. Even a simple chess software programme will beat the great majority of humans and the same goes many other games. A machine beat the two world champions of America’s most successful quiz show. Google’s AI can search better than any human. Machines can outperform all humans in terms of mechanical precision, speed, strength and endurance – that’s why they’re commonplace in manufacturing.
The software is not perfect but neither are humans. Human performance falls, when marking large numbers of essays, they make mistakes, have biases based on names and gender, cognitive biases, as well as biases on what is acceptable in terms of critiques and creativity. Progress has been made and more progress will come.
So what’s the deal?
This is not about replacing teacher assessment, it’s about automating some of that work to allow teachers to teach and provide more targeted, constructive feedback and support. It’s about optimising teachers’ time.
It works like this. The software takes lots of real essays, along with their human marked grades and looks for features within those grades that distinguish them from the other grades. In this sense, the software is using human traits and outputs and tries to mimic them when presented with new cases. The features the software needs to pick up on vary but can include missing absent words/phrases and so on. So it is NOT the machine or algorithms on their own doing the work, it’s a process of looking at what humans experts did when they marked lots of essays.
Machine grading gives you a score but it also gives you a probability, namely a confidence rating. This is important, as you can use this to retrain the algorithm on low confidence scored essays. AES also tries to give scores for each dimension in the scoring rubric, it’s not just an overall grade.
Numbers game
This is a numbers game. It won’t work in small classes, as you need hundreds, possibly many more, essays on that one topic, to train the software. If you run a specific course, with obscure content, to relatively small numbers of students, this road is not for you. If, however, you’re teaching a course, where the cumulative number of students, year after year is in the many hundreds, or across institutions to may thousands, even in MOOCs to hundreds of thousands, it starts to make sense. It makes even more sense in very high volume courses such as MOOCs. Sure it will make mistakes but is anyone seriously claiming that humans don’t make mistakes in essay grading? When researched, it is clear that they do. Machine marking even mimics the mistakes that the test human markers make when grading.
Style & creativity
It is wrong to say that AI cannot spot good style or good writing. It is possible to measure all sorts of aspects of syntactic style, such as sentence length, identifying clichés, use of connectives, good phrases, relevant and subtle vocabulary and so on. If these have been identified by human graders, they can potentially be spotted by AI. More than this Natural Language Processing techniques, such as Latent Semantic Analysis can identify relevant language, even synonyms, with Word Sense Disambiguation eliminating words that have multiple meanings. In other words, this is more than just lots of string matching. That’s not to say that it can do all of these tasks well but it can do some of it well, after getting a large amount of aggregated data from real, consistently graded essays by trained professionals. In a sense, style is already being checked in Word, as spelling is corrected, grammatical errors caught, common errors such as commonly confused homonyms highlighted and so on. But to be frank, style is not often a key learning objective, so let’s not shoot the tennis player for being bad at golf.
It would be fair to say that machine grading systems do not mark for creativity but this is a notoriously difficult term to define and human graders may well be hugely variable in making this judgement. If you’re teaching  ‘creative writing’ course, fair enough, it isn’t going to help. And don’t use automatic essay grading. But unless you’re being explicit about being creative, whatever that means, it’s unfair to blame the software for lacking creative judgement. Again, unless you as a human assessor know what ‘creativity’ means, and can define it to a level that gives AI a chance to spot it, don’t blame the software.
To what problem(s) is this a solution?
1. Frees up teacher time
What do teacher’s see as a their toughest chore – marking. This frees up their time to teach, not mark. One problem, that it can address, is overwork by teachers, the product of the massification of education, with ever greater numbers taking courses and stretched teachers.
2. First pass quality control
This massification has also produced lots of students who submit thin, barely revised essays that clearly only meet the word count. These need to be cut of at the pass and immediate feedback given that the work is not yet fit to submit. This helps the students and doesn’t waste the time of the teacher. For teachers, it can be also used as a ‘first pass’ tool. Take some of the spade work out of assessing essays by letting the software pick up on the more obvious errors so that you can focus on your personalised input and more detailed feedback on specific points.
3. Instant feedback
Students want instant feedback, They don’t want to wait days, and let’s face it, in reality it’s more likely to be weeks before getting a back a grade and a few comments. Grading thus becomes feedback in the process of learning. We need to be clear that this improves learning. No teacher is EVER going to go through half a dozen iterations for every student essay submitted. Machine marking will do this with ease.
4. Second reader
This idea has been around for a long time and sees machine marking as a check on human assessment. It’s like having another assessor at your side to check that you’re being consistent and on your game. If the two differ, a second human grades the essay. Alternatively, the teacher can look at the essay again and consider  a regrading, or not.
5. Focus teacher effort
A first pass machine marking run can allow teachers to focus more on students who need help, rather than on students who perform well and need less support.
6. Supplement to peer assessment
Those who already use peer assessment are well up the ladder as machine assessment seems like a natural step. The trouble with peer grading and feedback, is that it can be unreliable and suffers from grade inflation, as much as 25%. If machine grading can lower this, then let’s take it seriously.
7. Anti-plagiarism
Interestingly, given that plagiarism is the genie that is well and truly out of the bottle, I see these techniques as identifying plagiarised, or at least essays not written by the student. Comparisons across the essays submitted by one student may reveal inconsistencies that need further investigation.
Student reactions
Students may feel short-changed with machine marking but many already feel that way, with late, often barely commented feedback on essays submitted. Most students don’t gain much by the common lines from real teachers, such as ‘Needs more detail…” “Could be clearer…’ and so on. They want specific, helpful and constructive feedback, not vague, well-worn phrases. And if you’ve never seen patronising, condescending, rude and illegible comments, I doubt that you’ve submitted many essays! This is why it is important to couch the feedback from machine learning with reasonable explanations about its supportive role in improving performance in essay writing. It’s not there to replace teachers, it’s there to help students do better. So is also important to prime students in terms of expectations – lowering them, and being honest about what you are doing.
Some final reflections
We have to be careful here. I’m conscious of the work done by Roger Schank and others about the sheer difficulty of interpreting ‘meaning’ from the written and spoken word. Nevertheless, the world has changed and AI is once again on the march, this time it may even take flight fuelled by large amounts of data. Remember also, that there is a world of difference between grading and feedback.
One major issue arises when you reflect, in detail on AI grading. Why do we rely so much on essays as a form of assessment? One has to conclude, that like lectures, they have become an easy default. They’re easy to set, difficult to mark. I’d much rather look to a range of appropriate assessment methods that are designed around the type of learning objectives and competences you want your students to acquire. I’m not saying that essays should never being used, only that they are over-used. It’s institutionalised teaching, not optimal teaching or learning. I’d be pleased to see these techniques applied to more than essays but also to images, video and performance within virtual reality simulations, where I’ve seen some remarkable assessment of real competences trained, assessed and even certified within the simulation.

 Subscribe to RSS


Anonymous Anonymous said...

hi Donald,

as with any change the battle is over how much control various parties have; i think teachers are rightly concerned that automated marking will restrict rather than help their assessment judgements nevermind considering the real threat of management adding such systems to their control arsenal

then of course the usual question of how much training will teachers get to understand automated marking systems before being "asked" to use them?

the rhetoric i have seen claims the inevitability of machine assessment yet will teachers be allowed time to make informed decisions?

i wrote a bit about such claims in language testing & assessment -

you make a good point about reliance on essays in general


9:01 PM  
Blogger Donald Clark said...

Fair comment and as I say in my piece, I believe that this should be used by teachers. However, the main objective, if these systems evolve, must always be to improve the lot of the learner. It is pedagogically hopeless to leave students waiting for weeks for often sparse feedback and a letter grade. I also think that teachers can't complain about workload on the one hand and then reject sensible efforts to reduce that workload. I think we essentially agree, although I'm ultimately on the side of the learner in all this and think that in HE, and often in schools, the currents state of play is far from satisfactory. Thanks for your comment.

10:16 PM  
Blogger Brian Mulligan said...

Hi Donald.

HAve you a reference for the claim that peer assessment leads to grade inflation?

"6. Supplement to peer assessment
Those who already use peer assessment are well up the ladder as machine assessment seems like a natural step. The trouble with peer grading and feedback, is that it can be unreliable and suffers from grade inflation, as much as 25%. If machine grading can lower this, then let’s take it seriously."


12:25 PM  
Blogger Donald Clark said...

When rating one’s peers there is a tendency to inflate marks and to end pile on the positive end of the scale (McCarty and Shrum 1997, 2000).2 This effect is exacerbated by students’ reluctance to assign low grades to their classmates (Ballantyne, Hughes, and Mylonas 2002; Brindley and Scoffield 1998). The result is that many receive the same mark even when differences in perfor- mance actually exist (Ovadia 2004). McCarty, J.A., and L.J. Shrum. 1997. Measuring the importance of positive constructs: A test
of alternative rating procedures. Marketing Letters 8, no. 2: 239–50. McCarty, J.A., and L.J. Shrum. 2000. The measurement of personal values in survey research. Public Opinion Quarterly 64, no. 3: 271–98. Ballantyne, R., K. Hughes, and A. Mylonas. 2002. Developing procedures for implementing peer assessment in large classes using an action research process. Assessment & Evalua- tion in Higher Education 27, no. 5: 427–41. Brindley, C., and S. Scoffield. 1998. Peer assessment in undergraduate programmes. Teaching in Higher Education 3, no. 1: 79–89. Ovadia, S. 2004. Ratings and rankings: Reconsidering the structure of values and their
measurement. International Journal of Social Research Methodology 7, no. 5: 403–14.

12:43 PM  

Post a Comment

<< Home