This lesson is still being designed and assembled (Pre-Alpha version)

Reproducible Research Things

Documentation

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is documentation?

Objectives
  • Describe what needs to be documented.

Documentation

Documentation is the idea of documenting your procedures for your experiment so that an outsider could understand the workings of your lab. This can include where your results and working data are saved.

Bus Factor

Have you got a new staff member coming onboard to your team? They are a prime candidate to collate information and document as it will help them become familiar to the team and learn how the lab/team works.

Note: Ideally you want to document anything that a lab member coming on board would need to know. Documentation is all about changing your Bus Factor - how many people on a project would need to be hit by a bus to make a project fail. Many times, projects can have a bus factor of one. Adding documentation means when someone goes on leave, needs to take leave suddenly or finishes their study, their work is preserved for your lab.

Documentation helps with reproducible science

Documentation will also be important for any audits in your lab or if someone would like to reproduce your research.

Documentation is a love letter to your future self -by Damian Conway

How do we start? - Beginners

Read this first: How to start Documenting and more by CESSDA ERIC. Start with documenting in a text file or document - any start is a good start. Have this document automatically synced to the cloud with your data or keep this in a shared place that your organisation supports and recommends.

How do we start? - Intermediate

Once you have the basics in place, go into detail on how your workflow goes from your raw data to the finished results. This can be anything from a downloaded function list from SPSS/Virtual Lab to the code used to create it.

How do we start? - Advanced

Now that you’ve got a good head start, time to learn about Git Repositories and wikis.

External Resources

Key Points

  • Documentation is the idea of documenting your procedures for your experiment so that an outsider could understand how to reproduce it. This can include where your results and working data are saved.


Naming conventions

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is a File Naming Convention?

  • What is a File Name?

  • What are the benefits of using a file naming convention?

Objectives
  • First learning objective. (FIXME)

What is a File Naming Convention?

A File Naming Convention (FNC) is a framework or protocol if you like for naming your files in a way that describes what files contain and importantly, how they relate to other files. It is essential prior to collecting data to establish an agreed FNC.

What is a File Name?

File names are the names that are listed in the file directory and that team members give to new files when they are saved for the first time.

What are the benefits of using a file naming convention?

Naming files consistently, logically and in a predictable manner will prevent against unorganised files, misplaced or lost data. It could also prevent possible backlogs or project delays. A file naming convention will ensure files are:

Checklist

The University of Edinburgh has a comprehensive and easy to follow list (with examples and explanations) of 13 Rules for file naming conventions

Coming up with a plan for your team on how to name files.

Former PhD student and subsequent founder of the Figshare platform, Mark Hahnel, typified a common challenge: ‘During my PhD I was never good at managing my research data. I had so many different file names for my data that I always struggled to find the correct file quickly and easily when it was requested. My former PI was so horrified upon seeing the state of my data organisation that she held an emergency lab book meeting with the rest of my group when l was leaving’. - Research Information, April/May 2014

Your research team should agree on the following elements of a file name prior to data collection:

As previously suggested, consistent and meaningful naming of files and folders can make everyone’s life easier. See this example below:

.language-python: YYYYMMDD_SiteA_SensorB.CSV Date Location Sensor

Which when applied, would look like this below

20150621_Yaouk_Humidity.CSV

Some characters may have special meaning to the operating system so avoid using these characters when you are naming files. These characters include the following: / \ “ ‘ * ; - ? [ ] ( ) ~ ! $ { } &lt > # @ & space tab newline https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/com.ibm.aix.osdevice/filename_conv.htm

Naming conventions - Beginner

Let’s look at some naming convention for your data files and documents. Any dates are best stored with YYYY-MM-DD. Try to avoid spaces in your file names

Intermediate

Make sure you follow the 13 Rules for file naming conventions

Naming conventions - Advanced

Do you have a policy in your team around naming conventions? If not, this is a great way of getting everyone on the same page.

Internal Resources

External Resources

Key Points

  • A File Naming Convention is a framework for naming your files in a way that describes what files contain and how they relate to other files.


Folder structure

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Why is a folder structure helpful?

Objectives
  • Describe what needs to be documented.

Folder structure

Having a standard folder structure can keep your files neat and tidy and save you time looking for data. It can also help if you are sharing files with colleagues and having a standard place to put working data and documentation.

Like files, folders can also follow a naming convention. By prefixing with numbers, you can force your files to be ordered by the steps in your workflow. Probably the simplest way to document your structure - for your future reference - is to add a “README” file - a text file outlining the contents of the folder.

A folder structure might look like this image folder structure

How to develop a folder structure

To develop a logical structure for your team, you need to consider the following points:

Beginner

Pick a dataset and illustrate how you currently organise your files. (For the artists: Draw a picture that describes your current approach to file organisation)
See if you can devise a better naming convention or note one or two improvements you could make to how you name your files

There’s some really good folder template shapes around. Here’s one you are welcome to download and use URL Or another you could try out if you preferfrom http://nikola.me/

Advanced

Come up with a policy for your group for folder structures. You could create a template and put it in a downloadable location for them to get them started.

External Resources

Key Points

  • Having a standard folder structure can keep your files neat and tidy and save you time looking for data. It can also help if you are sharing files with colleagues and having a standard place to put working data and documentation.


Automation

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How can you automate any repetitive tasks?

Objectives
  • First learning objective. (FIXME)

Often, tasks that need to be done over and over again by a human can be opportunities for human error to sneak in. Setting up an automated way of doing this can eliminate this issue. Anything from an excel formula or macro to coding in a data science frameword can help.

Ways you can automate things:

Beginner

Let’s thing about the repetitive tasks that you could automate- do you always rename files the same way? Do you manually copy files across?

Advanced

Could you code up your work so its completely automated?

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Versioning

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is a version control system?

  • Are you keeping track and logs of your analysis?

Objectives
  • First learning objective. (FIXME)

Version control system

A version control system allows users to keep track of changes in your Data or Process

Are you keeping track of any versions or logs made by the software in use?

Make sure you have a copy of every step you have completed and if possible, version numbers for the program you are using and any libraries. Programs change over time and this can alter your results if someone asks to replicate your work post publication.

Never make alterations to your raw data files

Instead, make a copy of the raw data files and keep them somewhere safe (like Research Vault). That way, if you need to redo your work or you find an error earlier in your workflow, you have an original baseline to start from.

Write down versions of analysis software

Write down the versions of analysis software (like SPSS or NVIVO etc) AND hardware (MRI machines etc). Your documentation is a great place for this, but even just in your lab notebook will work.

Random Number Generator

If you are using random numbers in your research, save your random seed generator number as part of your working data. This way, you can later reproduce your results.

Beginner

Copy your raw data to a cloud storage solution such as Research Vault for safe keeping.

Intermediate

If you are using a workflow program (Galaxy, KNIME, a virtual lab like EcoCloud or TINKER Humanities,Arts and Social Science Virtual Lab, you can copy your workflow and save it as part of your documentation. Write the date that you ran the workflow if versions of the software are not available.

Advanced 1

If you are writing scripts (R/Python/Matlab etc), use Git.

Note: Griffith has a gitlab version you can use for private repositories. Also record the version of R/Python/Matlab, the operating system you are using and the version numbers of any library you are using.
If you are using the HPC, also record the version of any modules you used there.

Advanced 2

If you’ve heard of Docker or Singularity and you are interested, come talk to hacky hour/eResearch Services

External Resources

Key Points

  • A version control system allows users to keep track of changes in your Data or Process


Cloud Storage of your Data

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

Keep a copy of your data on the cloud

Keeping a copy of all your data (working, raw and completed) in the cloud is incredbilty important. This ensures that if you have a computer failure, accidently delete your data or your data is corrupted, your research is restorable.

Griffith has three different types of cloud storage made especially for research

Research Drive

This would be a good place for your day-to-day working files. It is unlimited and you can share it with people at Griffith (but not externally). This works the same as G drive.

Research Space

This has a ‘sync’ client that automatically copies your files from your computer to the cloud- just like dropbox or google drive. You can use this to share with people external to the university. You can add them with a Linkedin profile, Griffith, other university or Gmail account, or you can share with a URL, password and expiry date. This is also unlimited storage- you are given 5GB initially, and to add an unlimited folder, just click ‘Add more storage’.

Research Vault

For your long term backups. Perfect place to store a safe copy of your raw data or the research of your PhD student who has completed and is leaving the institute.

Not sure which one is best? Click here

Beginner

Get your data into Research Storage - If you need help picking one, talk to the library or eResearch Support

Advanced

Build a policy for your team or group on where things are stored. Make sure the location of your data is saved in your documentation

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Computer Security

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

Security

Ensuring that your computer and network are secured means that you have far less a chance of a data breach or hack.

Beginner

Have good strong passwords and encrypt your computer’s hard drive

Intermediate

Get set up on a password manager

Advanced

Let’s ensure your lab/office is encrypted and practicing safe habits Note: The boss’s computer is usually the most insecure

Encrypt your computer

Strong passwords

Using Multi-Factor Authentication when the option is available (Signing in with a password and an email to your account with a pin)

Avoid unsecure wifi - If its available, Eduroam is usually a better option than free wifi/cafe wifi

Use a VPN whenever you’re not at work

Keeping your OS and products up to date (esp web browser)

Griffith provides Symantec anti-virus FREE for Griffith staff and students https://intranet.secure.griffith.edu.au/computing/software/self-help-and-support/software-download-service4

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Separating identifying variables from your data

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is sensitive data?

  • How can we make data non-sensitive and still useful?

Objectives
  • First learning objective. (FIXME)

Sensitive data are data that can be used to identify an individual, species, object, or location that introduces a risk of discrimination, harm, or unwanted attention. Major, familiar categories of sensitive data are: personal data - health and medical data - ecological data that may place vulnerable species at risk.

Separating or de-identifying your data generally occurs to protect an individuals privacy. According to the Australian Privacy Act 1988, “personal information is de-identified if the information is no longer about an identifiable individual or an individual who is reasonably identifiable”. De-identified information is no longer considered personal information and can be shared. More information on the Commonwealth Privacy Act can be located at https://www.legislation.gov.au/Details/C2016C00979

De-identifiying aims to allow data to be used by others for publishing, sharing and reuse without the possibility of individuals/location being re-identified. It may also be used to protect the location of archaeological findings, cultural data of location of endangered species.

Any identifiers (name, date of birth, address or geospatial locations etc) should be removed from main data set and replaced with a code/key. The code/key is then preferably encrypted and stored separately. By storing de-identified data in a secure solution, you are meeting safety, controlled, ethical, privacy and funding agency requirements.

Re-identifing an individual is possible by recombining the de-identifiable data set and the identifiers.

Australian practical guidance for De-identification (ARDC)

Australian Research Data Commons (ARDC) formerly known as Australian National Data Service (ANDS) released a fabulous guide on De-identification. The De-identification guide is intended for researchers who own a data set and wish to share safely with fellow researchers or for publishing of data. The guide can be located here https://www.ands.org.au/working-with-data/sensitive-data/de-identifying-data

Here are examples of practical guidelines available nationally

Tips for managing de-identificatioin (ARDC)

Management of identifiable data (ARDC)

Data may often need to be identifiable (i.e. contains personal information) during the process of research, e.g. for analysis. If data is identifiable then ethical and privacy requirements can be met through access control and data security. This may take the form of:

Safely sharing sensitive data guide (ARDC)

Attribution:

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Identifiers

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • What is a DOI?

  • What is a PID?

Objectives
  • First learning objective. (FIXME)

Digital Object Identifier (DOI) and Persistent identifier (PiD)

Once you’ve completed your project, help make your research data discoverable, accessible and possibly re-useable using a PiD such as a DOI!

A Digital Object Identifier (DOI) is a unique alphanumeric string assigned by either a publisher, organisation or agency that identifies content and provides a PERSISTENT link to its location on the internet, whether the object is digital or physical. It might look something like this http://dx.doi.org/10.4225/01/4F8E15A1B4D89. The DOI or the Identifier is listed at the bottom of this record from Griffiths’ Research Data Repository.

DOIs are also considered a type of persistent identifiers (PiDs). An identifier is any label used to name some thing uniquely (whether digital or physical). URLs are an example of an identifier. So are serial numbers, and personal names. A persistent identifier is guaranteed to be managed and kept up to date over a defined time period.

Journal publishers assign DOIs to electronic copies of individual articles. DOIs can also be assigned by an organisation, research institutes or agencies and are generally managed by the relevant organisation and relevant policies. DOIs not only uniquely identify research data collections, it also supports citation and citation metrics.

Key messages:

Beginner

Ensure data you associate with a publication has a DOI- your library is the best group to talk to for this.

Intermediate

  • Learn more about how your DOI can potentially increase your citation rates by watching this 4m:51s video
  • Learn more about how your DOI can potentially increase your citation rate by reading the ANDS Data Citation Guide

Advanced

Learn more about PiDs and DOIs https://www.ands.org.au/guides/persistent-identifiers-awareness|

Internal Resources

External Resources

Key Points

  • A DOI is a Digital Object Identifier

  • A PiD is a Persistent identifier