Thursday, June 27, 2024

This blind trial paper raises some serious questions on assessment

In a rather astonishing blind trial study (markers were unaware) by Scarfe (2024),  they inserted GenAI written submissions into an existing examination system. They covered five undergraduate modules across all years of BSc Psychology at Reading University.

The results were, to my mind, not surprising, nevertheless, quite shocking.

94% AI submissions undetected

AI submission grades on average half grade higher than students

What lessons can we learn from this paper?

First, faculty can’t distinguish AI from student work in exams (94% undetected) and second which is predictable, also that AI outoerformed students, with a half grade higher, again unsurprising, as a much larger study, Ibrahim (2023) Perception, performance, and detectability of conversational artificial intelligence across 32 university courses showed that ChatGPT’s performance was comparable, if not superior, to that of students across 32 University courses. They added that AI-detectors cannot reliably detect ChatGPT’s, as they too often claim that human work is AI generated and the text can be edited to evade detection.

Framing everything as plagiarism?

More importantly, there is an emerging consensus among students to use the tool, while faculty tend to see it only through the lens of and among educators to treat its use as plagiarism. 

The paper positions itself as a ‘Turing test’ case study. In other words, the hypothesis was that GPT-4 exam outputs are largely indistinguishable from human. In fact, on average the AI submission scored higher. They saw this a a study about plagiarism but there are much bigger issues at stake. As long as we frame everything as a plagiarism problem, we will miss the more important, and hard, questions. 

This is becoming untenable.

Even primitive prompts suffice?

As Dominik Lukes at the University of Oxford, an expert in AI, noted about this new study; “The shocking thing is not that AI-generated essays got high grades and were not detected. Because of course, that would be the outcome. The (slight) surprise to me is that it took a minimal prompt to do it.”

We used a standardised prompt to GPT-4 to produce answers for each type of exam. For SAQ exams the prompt was:

Including references to academic literature but not a separate reference section, answer the following question in 160 words: XXX

For essay-based answers the prompt was:

Including references to academic literature but not a separate reference section, write a 2000 word essay answering the following question: XXX

In other word, wit the absolute minimal effort undetectable AI submissions are possible and produce better than student results.

Are the current methods of assessment fit for purpose?

Simply asking for short or extended text answers seems to miss many of the skills that an actual psychologist requires. Text assessment is often the ONLY form of assessment in Higher Education, yet psychology is a subject that deals with a much wider range of skills and modalities.

Can marking be automated?

I also suspect that the marking could be completely automated. The simple fact that a simply prompt to score higher than the average student suggests that the content is easily assessable. Rather than have expensive faculty assess, provide machine assed feedback to faculty for more useful student feedback.

Will this problem only get bigger?

It is clear that a simple prompt with the question suffices to exceed student performance. I would assume that with GPT-5 this will greatly exceed student performance. This leads into a general discussion about whether white collar jobs, in psychology or, more commonly the jobs psychology graduates actually get, require this time, expense (to both state and student). Wouldn’t we bet better focusing on training for more specific roles in healthcare and other field, such as HR and L&D, with modules in psychology?

Should so many doing a Psychology degree? 

Over 140,425 enrolments in psychology in 2022/23. It seems to conform to the gener stereotypical idea that females tend to choose subjects with a more human angle, as opposed to other sciences 77.2% are female and the numbers have grown massively over the years. Relatively few will get jobs in fields directly related to psychology, even indirectly, it is easy to claim that just because you studied psychology you have some actual ability to deal with people better.

The Office for Students (OfS) suggests that a minimum of 60% of graduates enter professional employment or further study. Graduate employability statistics for universities and the government are determined by the Graduate Outcomes survey. The survey measures a graduate's position, whether that's in employment or not, 15 months after finishing university. Yet the proportion of Psychology graduates undertaking further education or finding a professional job is relatively small compared with vocational degrees 15 months after graduation (Woolcock & Ellis, 2021).

Many graduates' career goals are still generic, centring on popular careers such as academia, clinical and forensic Psychology that limit their view of alternative and more easily accessible, but less well paid, jobs (Roscoe & McMahan, 2014). It actually takes years (3-5) for the very persistent, to get a job as a Psychology professional, due to postgraduate training and work experience requirements (Morrison Coulthard, 2017).

AI and white-collar jobs?

If GPT-4 performs at this level in the exams across the undergraduate degree, then it is reasonable, if not likely, top expect the technology to do a passable job now in jobs that require a psychology degree. Do we really need to put hundreds of thousands of young people through a full degree in this subject at such expense and with subsequent student debts when many of the tasks are likely to be seriously affected by AI? Many end up in jobs that do not actually require a Degree, certainly a Psychology degree. It would seem that we are putting more and more people through this subject but for fewer and fewer available jobs, and even then, their target market is likely to shrink. It seems like a mismatch between supply and demand. This is a complex issue but worthy of reflection.


Rather than framing everything as a plagiarism problem, seeing things as a cat and mouse game (where the mice are winning), there needs to be a shift towards better forms of assessment but more than this some serious reflection on why we are teaching and assessing, at great expense, in subjects, that are vocational in nature but have few jobs available.. 

No comments: