Wednesday, July 04, 2018

Data is not the new oil, more likely the new snakeoil….

to data analytics projects one must really think hard about what 'data' actually is. The problem with many of these projects is that they can turn into 'data' projects and not business projects.
Before jumping inOil is messy and dirty (crude) when it comes out of the ground but it is largely useful when fractioned. Up to 50% is used for petrol, 20% distillate fuel (heating oil and diesel fuel) and 8%  jet fuel. The rest has many other useful purposes. The unwanted elements and compounds are a tiny percentage. Data is not the new oil. It’s stored in weird ways and places, is often old, useless, messy, embarrassing, secret, personal, observed, derived, analytic, may need to be anonymised, training sets identified and subject to GDPR. To quote that old malapropism, ‘data is a minefield of information’!

1. Data dumps
Data is really messy, with much of it in:
  • odd data structures
  • odd formats/encrypted
  • different databases
Just getting a hold of the stuff is difficult.

2. Defunct data
Then there’s the problem of relevance and utility, as much of it is:
  • old
  • useless
  • messy
In fact, much of it could be deleted. We have so much of the stuff because we haven’t known what to do with it, don’t clean it and don’t know how to manage it.

3. Difficult data
There are also problems around data that is:
  • embarrassing
  • secret
There may be very good reasons for not opening up historic data, such as emails and internal communications. It may open up a sizeable legal and other HR risks for organisations. Think Wikileaks email dumps. It’s not like a barrel of oil, more like a can of worms. Like oil spills, we also have data leaks.

4. Different data
Once cleaned, one can see that there’s many different types of data. Unlike oil it has not so much fractions as different categories of data. In learning we can have ‘Personal’ data, provided by the person or actions performed by that person with their full knowledge. This may be gender, age, educational background, needs, stated goals and so on. Then there’s ‘Observed’ data from the actions of the user, their routes, clicks, pauses and choices. You also have ‘Derived’ data inferred from existing data to create new data and higher level ‘Analytic’ data from statistical and probability techniques related to that individual. . Data may be created on the fly or stored.

5. Anonymised data
Just when you thought it was getting clearer. You also have ‘Anonymised’ data is a bit like oil of an unknown origin. It is clean of any attributes that may relate it to specific individuals. This is rather difficult to achieve as there are often techniques to back engineer attribution to individuals.as

6. Supervised data
In AI there’s also ‘Training’ data used for training AI systems and ‘Production’ data which the system actually uses when it is launched in the real world. This is not trivial. Given the problems stated above, it is not easy to get a suitable data set, which is clean and reliable for training. Then, when you launch the service or product the new data may be subject to all sorts of unforeseen problems not uncovered in the training 

7. Paucity of data
But the problems don’t stop there. In the learning world, the data problem is even worse as there is another problem – the paucity of data. Institutions are not gushing wells of data. Universities, for example, don’t even know how many students turn up for lectures. Data on students is paltry. The main problem with the use of data in learning, is that we have so little of the stuff.  SCORM, which has been around for 20 plus years literally stopped the collection of data with its focus in completion. This was the result of a stupid decision by a bunch of folk at ADL. This makes most data analytics projects next to useless. The data can be best handled in a spreadsheet. It is certainly not as large, clean and relevant as it needs to be to produce genuine insights.

Prep
Before entering these data analytics projects ask yourself some serious questions about 'data. Data size by itself, is overated, but size still matters, whether n = tens, hundreds, thousands, millions, the Law of Small Numbers still matters. Don’t jump until you are clear about how much relevant and useful data you have, where it is, how clean it is and in what databases.
New types of data may be more fruitful than legacy data. In learning this could be dwell time on questions, open input data, wrong answers to questions and so on.
More often than not, what you have as data is really proxies for phenomenon. Be careful here, as your oil may actually be snakeoil.
Conclusion
GDPR has made its management and use more difficult. All of this adds up to what I’d call the ‘data delusion’, the idea that data is one thing, easy to manage and generally useful in data analytics projects in institutions and organisations. In general, it is not. That's not to say you should ignore its uses - just don't get sucked into data analytics projects in learning that promise lots but deliver little. Far better to focus on the use of data in adaptive learning or small scale teaching and learning projects where relatively small amounts of data can be put to good use.


No comments: