By: Mike Aubrey
This month we got the last of our major instruments in place with the arrivals of the UV-vis-NIR and FT-IR. This brings our in-house instrument count to six. To keep this data organized, accessible, and secure we developing a data management plan.
Redundancy: Most importantly we want to store all instrument data collected in triplicate (1) the source file saved by the instrument, (2) a group accessible cloud copy, and (3) an archived copy with limited access and permanent storage of outdated/deleted files.
Stability: We want to use only very well established technologies known for their reliability and widespread adoption. We don’t have the resources to have someone tinkering with data backups all the time. Once in place we need to to rely on this without user intervention for long periods of time.
Accessibility: All data collected in the lab should be accessible to lab members from their own computers. Syncing and transferring files should be trivial. The barrier to including and documenting data in a lab notebook is as close to zero as possible.
We set up rsync on each instrument to locally synchronize our instrument’s default save folder with an exact copy in the local Box Drive folder. Rsync has been a standard file synchronization tool for decades, is free, and continues to be used at large scale. While rsync is native to Linux and biult into Mac OS, it requires an emulator to run on Windows. We already use git bash in the lab and so we just installed rsync to run within git bash. Other options are available. Windows Subsystem for Linux (WSL) or Cygwin would also work. Here is a walk through for installing rsync in git bash.
Our standard rsync configuration is given below. The important options are -a
for preserving file metadata from the source and --delete
which ensures the destination’s file structure is an exact replica of the source. All files added or changed to the destination folder by users are deleted at sync. The other options pertain to generating a change log of each sync event.
rsync -raP path/to/source path/to/dest \
--info=COPY2,DEL2,NAME2,BACKUP2,REMOVE2,SKIP2 \
--delete \
--log-file=path/to/log.log
After setting up rsync we create a desktop shortcut that will run rsync when clicked. Computers are also scheduled to run the rsync script every day at midnight using Windows task scheduler.
UT Austin has “unlimited” cloud storage with Box. We just needed to set up a group Box account. All group computers are logged into this account and the group data folder is shared with the personal university Box accounts of all group members.
A side benefit is that we’ve essentially eliminated the need for USB data transfers and all data is available on group member computers instantly. We don’t have a good solution for dealing with all the user facilities around campus other than relying on individuals to immediately upload experimental data to Box before accessing it on their personal computers. In the lab, group members copy data from box drive into their notebooks in the normal course of documenting experiments. Our lab notebooks are version controlled and managed using Gitlab.
Group members typically access all of their instrument data via Box. Essentially everyone in the group has read and write access to shared box files. These open access rights come with the risk of someone accidentally deleting or overwriting important files. For this reason we also have network attached storage system that archives every file that appears in the group Box account. We set this up in collaboration with our IT department which had already deployed similar solutions elsewhere on campus. However this was all done using basic settings available on the Synology drive and Box.
While we do give our best effort in maintaining good data management practices our current practices are certainly limited by our scale, resources, and experience. Best practices are better documented elsewhere. The Turing Way is one of the best practical guides I’ve found for implementing procedures for reproducible research and the management of digital data.