Liberating and future-proofing your research data

You should store and document all research data in such a way that a complete stranger could open it and figure out how to use it. You may not think anyone will ever come behind you and use your data/results/code/simulations. That might be true, but consider this—in 10 years do you have any advantage over a total stranger when it comes to interpreting your undocumented data and processes?

Not as much of a margin as you think, I’d bet.

I recently had to liberate several gigabytes of data from a directory of SigmaPlot files, which I received from a colleague. These files contained experimental results I am using to validate numerical models. SigmaPlot appears to be a powerful tool (the graphs were beyond outstanding), but this was a time-consuming, tedious process. I had to install and activate an evaluation copy—I was fortunate this existed—and traverse workbook by workbook within each file since the Excel export only works on a per-workbook basis. Of course, I don’t use Excel so I had further work to do to prepare everything for processing in Python.

All of this would have been a moot point if the data had been stored as CSV or plain text. I can open and process data stored in CSV on any operating system with a large number of tools, for free. And I am confident in 10 years time, I will be able to do the same.

This experience solidified my resolve to design my research processes in such a way to minimize any friction for anyone in the future who might want to work with whatever files or data I leave behind.

Functionally for me, this means (where possible!):

  • Open source beats closed source.
  • Ubiquitous beats niche software.
  • Automation/scripting beats manual processes.
  • Plain text beats binaries.
  • README’s in every project directory.

I could go on about this, but I’ll stop here for now. I would love to hear from the researchers in the crowd, so I’ll ask:

What measures do you implement to future-proof your data and processes?

Update: Eugene Wallingford wrote a really thoughtful response to this post called The Future of Your Data that I’d love for you to check out.

You can read more about me, follow me on Twitter, subscribe to this blog by RSS or email, and find many more posts in the archives.