Friday, October 19, 2018

"On average, humans have one testicle”... that's why most data analytics projects in education are misleading....

AI will have a huge impact in learning. It changes everything… why we learn, what we learn and how we learn. Of course, AI is not one thing. It is a huge array of mathematical and software techniques. Yet looking at the spend in education and training, people have been drawn to one very narrow area – data analytics. This, I think, is a mistake.

Much of this so-called use of AI is like going over top of your head with your right hand to scratch your left ear. Complex algorithmic and machine learning approaches are likely to be more expensive and far less reliable and verifiable than simple measures like using a spreadsheet or making what little data you have available, in a visualized, digestible form to faculty or managers. Beyond this, traditional statistics is likely to prove more fruitful. Data analytics has taken on the allure of AI, yet many of it is actually plain, old statistics.

Data problems
Data is actually a huge problem here. They say that data is the new oil, more likely the new snakeoil. It is stored in weird ways and places, often old, useless, messy, embarrassing, personaland even secret. It may need to be anonymised, training sets identified and subject to GDPR. To quote that old malapropism, ‘data is a minefield of information’. It may even be massively misleading, as in the testicle example. You must assume, in learning, that your data is quite simply messy.

Paucity of data
The data problem is even worse than mere messiness, as there is another problem – the paucity of data. Institutions are not gushing wells of data. Universities, for example, don’t even know how many students turn up for lectures. I can tell you that the actual data, when collected, paints a picture of catastrophic absence. Data on students is paltry. The main problem with the use of data in learning, is that we have so little of the stuff. 
SCORM, which has been around for 20 plus years literally stopped the collection of data with its focus on completion. This makes most data analytics projects next to useless. The data can be handled in a spreadsheet. It is certainly not as large, clean and relevant, as it needs to be to produce deep and genuine insights.
Other data sources are similarly flawed, as there's little in the way of fine-grained data about actual performance. It's small data sets, often messy, poorly structured and not understood.

Data dumps
Data is often not as clean as you think it is, with much of it in:
  •          odd data structures
  •      odd formats/encrypted
  •      different databases
Just getting a hold of the stuff is difficult.

Defunct data
Then there’s the problem of relevance and utility, as much of it is:
  •        old
  •      useless
  •       messy
In fact, much of it could be deleted. We have so much of the stuff because we simply haven’t known what to do with it, don’t clean it and don’t know how to manage it.

Difficult data
There are also problems around data that can be:
  •        embarrassing
  •        secret
There may be very good reasons for not opening up historic data, such as emails and internal social communications. It may open up a sizeable legal and other HR risks for organisations. Think Wikileaks email dumps. Data is not like a barrel of oil, more like a can of worms. 

Different types of data
Once cleaned, one can see that there are many different types of data. Unlike oil, it has not so much 'fractions' as different categories of data. In learning we can have ‘Personal’ data, provided by the person or actions performed by that person with their full knowledge. This may be gender, age, educational background, needs, stated goals and so on. Then there’s ‘Observed’ data from the actions of the user, their routes, clicks, pauses, choices and answers. You also have ‘Derived’ data inferred from existing data to create new data and higher level ‘Analytic’ data from statistical and probability techniques related to that individual. Data may also be created on the fly.

Just when you thought it was getting clearer. You also have ‘Anonymised’ data, a bit like oil of an unknown origin. It is clean of any attributes that may relate it to specific individuals. This is rather difficult to achieve as there are often techniques to back engineer attribution to individuals.

In AI there’s also ‘Training’ data used for training AI systems and ‘Production’ data which the system actually uses when it is launched in the real world. This is not trivial. Given the problems stated above, it is not easy to get a suitable data set, which is clean and reliable for training. Then, when you launch the service or product, the new data may be subject to all sorts of unforeseen problems not uncovered in the training process. This is a rock on which many AI projects founder.

Data preparation
Before entering these data analytics projects, ask yourself some serious questions about 'data’. Data size by itself, is overrated, but size still matters, whether n = tens, hundreds, thousands, millions, the Law of Small Numbers still matters. Don’t jump until you are clear about how much relevant and useful data you have, where it is, how clean it is and in what databases.

New types of data may be more fruitful than legacy data. In learning this could be dwell time on questions, open input data, wrong answers to questions and so on. More often than not, what you have as data are really proxies. 

Action not analytics
The problem with spending all of your money on diagnosis, especially when the diagnosis is an obvious limited set of possible causes, that were probably already known, is that the money is usually better spent on treatment. Look at improving student support, teaching and learning, not dodgy diagnosis.
In practice, even when those amazing (or not so amazing) insights come through, what do institutions actually do? Do they record lectures because students with English as a foreign language find some lecturers difficult and the psychology of learning screams at us to let students have repeated access to resources? Do they tackle the issue of poor teaching by specific lecturers? Do they question the use of lectures? Do they radically reduce response times on feedback to students? Do they drop the essay as a lazy and monolithic form of assessment? Or do they waffle on about improving the ‘student experience’ where nothing much changes?

I work in AI in learning, have an AI learning company, invest in AI EdTech companies, am on the board of two other AI learning companies, speak on the subject all over the world, write constantly on the subject . You’d expect me to be a big fan of data analytics and recommendation engines - I’m not. Not yet. I’d never say never but so much of this seems like playing around with the problem, rather than facing up to solving the problem. That's not to say you should ignore its uses - just don't get sucked into data analytics projects in learning that promise lots but deliver little. Far better to focus on the use of data in adaptive learning or small scale teaching and learning projects where relatively small amounts of data can be put to good use.

AI is many things and a far better use of AI in learning, is, in my opinion, to improve teaching through engagement, support, personalised, adaptive learning, better feedback, student support, active learning, content creation (WildFire) and assessment. All of these are available right now. They address the REAL problem – teaching and learning. 

No comments: