Thinking about Open Data, with a little help from the Data Hub | SYS-CON MEDIA
abernard102@gmail.com 2012-08-03
Summary:
“Continuing to explore the adoption of explicit Open Data licenses, I’ve been having a trawl through some of the data in the Open Knowledge Foundation‘s Data Hub. I’m disappointed – but not surprised – by the extent to which widely applicable Open Data licenses are (not!) being applied... Governments are releasing data, making them more accountable, (possibly) saving themselves money by avoiding the need to endlessly answer Freedom of Information requests, and providing the foundation upon which a whole new generation of websites and mobile apps are being built. Museums and Libraries are releasing data, increasing visibility of their collections and freeing these institutional collections from their decades-long self imposed exile in the ghetto of their own web sites. Scientists are beginning to release their data, making it far easier for their peers to engage in that fundamental principle of science; the reproduction of published results. Open Data is good, and useful, and valuable, and increasingly visible. But without a license telling people what they can and cannot do, how much use is it? I’m running a short survey, inviting people to describe their own licensing choices. I’ve also taken a look at the Data Hub, which ‘contains 4004 datasets that you can browse, learn about and download...’ I began by querying the Data Hub’s api, to discover the set of permissible licenses. This resulted in a set of 15 possible values... Fully 50% of the records either explicitly state that there is no license (14), explicitly state that the license is ‘not specified’ (604), explicitly record a null value (523), or fail to include the license_id attribute at all (874). Given all of the effort that has gone into evangelising the importance of data licensing, and all the effort that Data Hub contributors have gone to in collecting, maintaining and submitting their data in the first place, that really isn’t very good at all... If we remove the 2,015 unlicensed records and the 31 errors ... the picture becomes somewhat clearer... The licenses that many have worked so hard to promote for open data (CC0, the Open Data Commons family and – in some circumstances – CC-BY) are far less prevalent than I’d expected them to be. 125 resources are licensed CC0, 273 CC-BY, 119 ODC-PDDL, 61 ODC-ODBL, and 36 ODC-BY. That’s a total of 614 out of 1,966 licensed resources, or just 31%. 44% of the 614 are licensed CC-BY; an attribution license based upon copyright rather than database rights. At least some of those may therefore be wrongly licensed. The two core data licenses are almost tied, (125 for CC0, 119 for ODC-PDDL), but together account for a tiny 12% of all the licensed resources in the Data Hub. The picture’s not all bad, as there is clearly a move toward the principle of ‘open’ and ‘public domain’ licenses. CC0 (125) and ODC-PDDL (119) are joined by 167 data sets licensed with some other public domain license. And with 444 data sets, ‘other open license’ is the single most popular choice; almost one quarter of the licensed data sets use an open license that is not one of the mainstream ones. In total, the Creative Commons family of licenses (including the odd ‘sharealike’ variant and the hugely annoying ‘noncommercial’ anachronism) account for 602 data sets, or 31%. The Open Data Commons family account for 216, or 11%. By most measures, we should probably welcome the use of any open or public domain license. But the more choices there are, the more scope there is for confusion, contradiction, and a lack of interoperability.... License proliferation is friction... Let’s work to eradicate the ‘None/ Not Specified’ category altogether, and then see what we can do to shrink all of the ‘Other’ categories.”