Thursday, August 08, 2024

Benchmarks on factual accuracy on LLMs

At last, the paper we’ve all been waiting for. It provides a benchmark for factual accuracy against established knowledge bases.

STUDY

"WILDHALLUCINATIONS: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries" generated a ton of stuff (118,785 generations), from 15 LLMs on 7,919 topics.

They vary in performance across domains but it is clear that those with Wikipedia pages do better. This is good news for school educators, as most of what is taught in a school curriculum is well covered in knowledge bases such as Wikipedia. This would also apply to many undergraduate courses, especially 101 courses.

RESULTS


Larger and more recent are better than older and smaller LLMs, with GPT-4o and GPT-3.5 showing the highest accuracy. Some models opt out or abstain from giving outputs, where they are more challenging queries. Another interesting conclusion was that open source models need to raise its game, as they performed worse that closed models.

What was clear, is that results are better if the topic has Wikipedia or similar pages. Unsurprisingly models tend to have lower factual accuracy on rarer or edge cases where there is less likely to be good structured sources available.

CONCLUSIONS
I'm not as obsesses by hallucinations as some. It comes from the expectation that they should be truth engines when they are tools that help us with tasks. Search engines come up with false results, books are full or mistakes, teachers make mistakes, as many subjects are now taught by teachers who do not have a Degree in that subject.

This is good news for school educators, as the topic covered invariably have good Wikipedia pages or similar. School-level curricula tend to be well covered in knowledge bases. I’d say this is also true for undergraduate courses, especially 101 courses, the area I’d focus on for AI support.

This is even truer for business applications where ground truth on knowledge is less important. It is much more qualitative.

It is also important to remember that this paper shows that things are getting a lot better, that will continue.

No comments: