Synchronizing and backing up our instrument data

Date: 2022-08-14

By: Mike Aubrey

This month we got the last of our major instruments in place with the arrivals of the UV-vis-NIR and FT-IR. This brings our in-house instrument count to six. To keep this data organized, accessible, and secure we developing a data management plan.

What we want from our data backup plan

Redundancy: Most importantly we want to store all instrument data collected in triplicate (1) the source file saved by the instrument, (2) a group accessible cloud copy, and (3) an archived copy with limited access and permanent storage of outdated/deleted files.

Stability: We want to use only very well established technologies known for their reliability and widespread adoption. We don’t have the resources to have someone tinkering with data backups all the time. Once in place we need to to rely on this without user intervention for long periods of time.

Accessibility: All data collected in the lab should be accessible to lab members from their own computers. Syncing and transferring files should be trivial. The barrier to including and documenting data in a lab notebook is as close to zero as possible.

Our solution

Researchers
Instrument computers
rsync
rsync
rsync
rsync
rsync
Synology Backup
Notebook
Student 1
Notebook
Student 2
Notebook
Student 3
Local Box Drive
Miniflex
Local Box Drive
Bio-Logic SP200
Local Box Drive
Bio-Logic VSP3e
Local Box Drive
Microscope
Local Box Drive
Spectrometers
Group Box Account
Remote NAS
Gitlab.com

We set up rsync on each instrument to locally synchronize our instrument’s default save folder with an exact copy in the local Box Drive folder. Rsync has been a standard file synchronization tool for decades, is free, and continues to be used at large scale. While rsync is native to Linux and biult into Mac OS, it requires an emulator to run on Windows. We already use git bash in the lab and so we just installed rsync to run within git bash. Other options are available. Windows Subsystem for Linux (WSL) or Cygwin would also work. Here is a walk through for installing rsync in git bash.

Our standard rsync configuration is given below. The important options are -a for preserving file metadata from the source and --delete which ensures the destination’s file structure is an exact replica of the source. All files added or changed to the destination folder by users are deleted at sync. The other options pertain to generating a change log of each sync event.

rsync -raP path/to/source path/to/dest \ 
--info=COPY2,DEL2,NAME2,BACKUP2,REMOVE2,SKIP2 \
--delete \
--log-file=path/to/log.log

Automatic backups

After setting up rsync we create a desktop shortcut that will run rsync when clicked. Computers are also scheduled to run the rsync script every day at midnight using Windows task scheduler.

University Box account

UT Austin has “unlimited” cloud storage with Box. We just needed to set up a group Box account. All group computers are logged into this account and the group data folder is shared with the personal university Box accounts of all group members.

A side benefit is that we’ve essentially eliminated the need for USB data transfers and all data is available on group member computers instantly. We don’t have a good solution for dealing with all the user facilities around campus other than relying on individuals to immediately upload experimental data to Box before accessing it on their personal computers. In the lab, group members copy data from box drive into their notebooks in the normal course of documenting experiments. Our lab notebooks are version controlled and managed using Gitlab.

Archival copy

Group members typically access all of their instrument data via Box. Essentially everyone in the group has read and write access to shared box files. These open access rights come with the risk of someone accidentally deleting or overwriting important files. For this reason we also have network attached storage system that archives every file that appears in the group Box account. We set this up in collaboration with our IT department which had already deployed similar solutions elsewhere on campus. However this was all done using basic settings available on the Synology drive and Box.

Better resources for data management in research

While we do give our best effort in maintaining good data management practices our current practices are certainly limited by our scale, resources, and experience. Best practices are better documented elsewhere. The Turing Way is one of the best practical guides I’ve found for implementing procedures for reproducible research and the management of digital data.