I’ve been taking lots of digital photographs recently, and I recently worried whether I’d still be able to view them in thirty or so years. I have several old documents in Pagemaker 4 format which I know I’ll never be able to read again. So, will this happen with my photos? Are my jpegs future-proofed?
I’m not worried about the physical media becoming obselete. Our abililty to store data has constantly increased. Every bit of data I have is kept on my harddrive. When I change machine then I copy it wholesale onto the new machine. I never “archive off” old material onto tape/CD/DVD to free up space, because my hard drive is always larger than my storage needs. I’m fairly confident that in 30 years time, I will still be able to access the raw sequence of bits which make up each photograph.
Will I be able to view these bits as photographs though? In the year 2033, I could probably emulate today’s hardware/software and still run exactly the same utility to view my photos. It’s a bit of a heavyweight solution though. I don’t really want to snapshot the current version of the linux kernel, XFree86, Gnome and GThumb just so I can view photos sometime in the future.
It seems more sensible to stash away a description of the jpeg file format – that way, even if noone else wants to view jpegs, I can still code up a viewer because I know what the sequence of bits means.
But how should I do that? Storing the source code of a C++ or Java JPEG viewer isn’t going to be much use, because In The Future it’ll be pretty hard to figure what the semantics of C++ or Java were in the year 2003. I’d have to stash away a copy of the Java Language Spec too, otherwise I’d just be left with a pile of meaningless squiggly brackets. We learned that when people had to tackle a myriad of COBOL dialects for Y2K problems.
Is there any “timeless” programming language which I could use? Something which I’ll still know the semantics for in 30 years? Hmm, I guess you could express a JPEG decoder using the lambda calculus, but that’s a bit extreme. The semantics of the lambda calculus will still be understood in 30 years, but you’d have a hard time figuring out what a large “lambda calculus JPEG decoder” actually did.
I think, if I’m going to stash away a description of the JPEG algorithm, it’s probably going to be a good old fashioned english-language description. That’s probably good enough for 30 years. I don’t think the semantics of the English language and mathematical notation are going to change much in that time. It’s not perfect – just look at the ambiguities which most “english language” specifications contain. But, the language .. the medium in which the description is expressed .. is probably stable for a good few decades. Maybe this Haskell paper is a good alternative to the published standards document.
I wish the story ended at this point. But it doesn’t! How should I store the description of the algorithm? I don’t expect I’ll be able to read PDF documents in the year 2033. Their semantics are way complex, more so than the JPEG format. You’d have to record the semantics of Truetype fonts, or whatever, in order to display them in the future. Postscript is no better. The Postscript reference manual which gives an english-language description of the format/language is over an inch thick.
ASCII or UTF-8 is going to be a good bet. I think we’ll still be able to read that in the future. I probably want to include some mathematical formulae, so maybe I need some mathematical markup too. It begins to sound like XML.
So that’s the next 30 years sorted out. What I wanted my data to last for 3000 years? Can you store information which transcends changes in language, notation and cultures?