Wanted: a Rosetta Stone for data, says Cerf - E & T Magazine
abernard102@gmail.com 2012-06-26
Summary:
“Cerf has warned of the issue in previous speeches but told E&T that research is needed into a variety of areas to prevent data from becoming unusable over a period of time. He said in his speech to an audience of leading computer-science researchers: ‘In 100 years’ time we may have all the bits but not understand what they mean’. Publication' is no longer confined to print, Cerf continued: ‘The notion of publication will expand to include not just audio, video and text but databases and objects. But that leads to another big problem. I call it the digital bit-rot hazard.’ He adds: ‘We need to preserve the software that interprets them and new intellectual property structures to ensure it happens. If people won’t make the IP available in perpetuity, we may have a problem’. Cerf also told E&T: ‘We are already experiencing this problem. People have photographs whose encodings are already not available. And there are institutions - such as the National Archives - that are faced with this in a very direct way. They receive digital archives from administrations that they are trying to curate.’ But the file formats used to store messages sent by officials and politicians as well as the databases they used to help make decisions could prove inaccessible over time. ‘If we don’t do something about this, we will have evaporating history.’ The work needs to proceed on a variety of fronts, Cerf explained, but each solution has problems. One option is for companies to guarantee backward compatibility or release source code when a file format is orphaned by later releases, but this is far from universally accepted. ‘If you are going to upgrade the software in a way that is not compatible, the IP regime should say that if you don’t allow for the source code to be available that you have continued object-code availability instead. But it gets worse, because the code may only run on certain operating systems and they may only run on certain types of hardware. How do you keep the hardware running in perpetuity? You can’t. You could use emulators running in the cloud but very, very quickly you get into substantial complexity and the issues of protecting IP.’ Cerf declared that ‘There is research to be done on IP regimes that would permit persistent access to software for interpreting complex files, as well as on self-defining data structures.’ Self-defining data would separate the need for the original program from the actual data, in effect providing a Rosetta Stone that would let curators determine the meaning contained in each file. But this is a long way from reality. Cerf explained that existing techniques are good for defining the syntax of data files – such as the way in which fields are structured – but they do a poor job of reflecting the semantics. The problem is whether it is possible to create a language or way of describing data completely that is more concise or more portable than the original software itself. ‘It sounds like a problem that Turing himself might formulate,’ he concluded.”