Documentation
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is documentation?
Objectives
Describe what needs to be documented.
Documentation
Documentation is the idea of documenting your procedures for your experiment so that an outsider could understand the workings of your lab. This can include where your results and working data are saved.
- Copy your lab notebook if you have one onto a digital format and save it to a safe place (such as research storage).
- Make sure these are saved somewhere that’s accessible to your supervisor/team.
Bus Factor
Have you got a new staff member coming onboard to your team? They are a prime candidate to collate information and document as it will help them become familiar to the team and learn how the lab/team works.
Note: Ideally you want to document anything that a lab member coming on board would need to know. Documentation is all about changing your Bus Factor - how many people on a project would need to be hit by a bus to make a project fail. Many times, projects can have a bus factor of one. Adding documentation means when someone goes on leave, needs to take leave suddenly or finishes their study, their work is preserved for your lab.
Documentation helps with reproducible science
Documentation will also be important for any audits in your lab or if someone would like to reproduce your research.
Documentation is a love letter to your future self -by Damian Conway
How do we start? - Beginners
Read this first: How to start Documenting and more by CESSDA ERIC. Start with documenting in a text file or document - any start is a good start. Have this document automatically synced to the cloud with your data or keep this in a shared place that your organisation supports and recommends.
How do we start? - Intermediate
Once you have the basics in place, go into detail on how your workflow goes from your raw data to the finished results. This can be anything from a downloaded function list from SPSS/Virtual Lab to the code used to create it.
How do we start? - Advanced
Now that you’ve got a good head start, time to learn about Git Repositories and wikis.
External Resources
- British Ecology Reproducibility Book
- How to start Documenting and more by CESSDA ERIC
- Software Carpentry Git Workshop
Key Points
Documentation is the idea of documenting your procedures for your experiment so that an outsider could understand how to reproduce it. This can include where your results and working data are saved.
Naming conventions
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is a File Naming Convention?
What is a File Name?
What are the benefits of using a file naming convention?
Objectives
First learning objective. (FIXME)
What is a File Naming Convention?
A File Naming Convention (FNC) is a framework or protocol if you like for naming your files in a way that describes what files contain and importantly, how they relate to other files. It is essential prior to collecting data to establish an agreed FNC.
What is a File Name?
File names are the names that are listed in the file directory and that team members give to new files when they are saved for the first time.
What are the benefits of using a file naming convention?
Naming files consistently, logically and in a predictable manner will prevent against unorganised files, misplaced or lost data. It could also prevent possible backlogs or project delays. A file naming convention will ensure files are:
- Easier to process - All team members won’t have to over think the file naming process
- Easier to facilitate access, retrieval and storage of files
- Easier to browse through files saving time and effort
- Harder to lose!
- Having logical and known naming conventions in place can also help you with version control (See Version Control for more information).
- Check for obsolete or duplicate records
Checklist
The University of Edinburgh has a comprehensive and easy to follow list (with examples and explanations) of 13 Rules for file naming conventions
Coming up with a plan for your team on how to name files.
Former PhD student and subsequent founder of the Figshare platform, Mark Hahnel, typified a common challenge: ‘During my PhD I was never good at managing my research data. I had so many different file names for my data that I always struggled to find the correct file quickly and easily when it was requested. My former PI was so horrified upon seeing the state of my data organisation that she held an emergency lab book meeting with the rest of my group when l was leaving’. - Research Information, April/May 2014
Your research team should agree on the following elements of a file name prior to data collection:
- Vocabulary - choose a standard vocabulary for file names, so that everyone uses a common language
- Punctuation - decide when to use punctuation symbols, capitals and hyphens
- Dates - agree on a logical use of dates so that they display chronologically i.e. YYYY-MM-DD
- Order - confirm which element should go first, so that files on the same theme are listed together and can therefore be found easily
- Numbers - specify the amount of digits that will be used in numbering so that files are listed numerically e.g. 01, 002, etc.
As previously suggested, consistent and meaningful naming of files and folders can make everyone’s life easier. See this example below:
.language-python: YYYYMMDD_SiteA_SensorB.CSV Date Location Sensor
Which when applied, would look like this below
20150621_Yaouk_Humidity.CSV
Some characters may have special meaning to the operating system so avoid using these characters when you are naming files. These characters include the following: / \ “ ‘ * ; - ? [ ] ( ) ~ ! $ { } < > # @ & | space tab newline https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/com.ibm.aix.osdevice/filename_conv.htm |
Naming conventions - Beginner
Let’s look at some naming convention for your data files and documents. Any dates are best stored with YYYY-MM-DD. Try to avoid spaces in your file names
Intermediate
Make sure you follow the 13 Rules for file naming conventions
Naming conventions - Advanced
Do you have a policy in your team around naming conventions? If not, this is a great way of getting everyone on the same page.
Internal Resources
- Talk to your Research Support Services librarian.
External Resources
- Naming things by Jenny Bryan
- File naming and folder conventions by CESSDA ERIC
- The University of Edinburgh has a comprehensive yet easy to follow list (with examples and explanations) of 13 Rules for file naming conventions https://www.ed.ac.uk/records-management/guidance/records/practical-guidance/naming-conventions
- Australian National Data Services (ANDS). (2018). ANDS Guide: File wrangling
Key Points
A File Naming Convention is a framework for naming your files in a way that describes what files contain and how they relate to other files.
Folder structure
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Why is a folder structure helpful?
Objectives
Describe what needs to be documented.
Folder structure
Having a standard folder structure can keep your files neat and tidy and save you time looking for data. It can also help if you are sharing files with colleagues and having a standard place to put working data and documentation.
Like files, folders can also follow a naming convention. By prefixing with numbers, you can force your files to be ordered by the steps in your workflow. Probably the simplest way to document your structure - for your future reference - is to add a “README” file - a text file outlining the contents of the folder.
A folder structure might look like this
How to develop a folder structure
To develop a logical structure for your team, you need to consider the following points:
- Check to make sure there are no pre-existing folder structure agreements
- Name folders appropriately and in a meaningful manner. Don’t use staff names and consider using the type of work
- Consistency - make sure you use the agreed structure/hierarchy
- Structure folders hierarchically - start with a limited number of folders for the broader topics, and then create more specific folders within these
- Separate ongoing and completed work - as you start to create lots of folders and files, it is a good idea to start thinking about separating your older documents from those you are currently working on
- Backup – ensure folders and files are backed up and retrievable in the event of a disaster. Griffith like most universities, have safe storage solutions.
- Clean up folders and files post project.
Beginner
Pick a dataset and illustrate how you currently organise your files. (For the artists: Draw a picture that describes your current approach to file organisation)
See if you can devise a better naming convention or note one or two improvements you could make to how you name your filesThere’s some really good folder template shapes around. Here’s one you are welcome to download and use URL Or another you could try out if you preferfrom http://nikola.me/
Advanced
Come up with a policy for your group for folder structures. You could create a template and put it in a downloadable location for them to get them started.
External Resources
Key Points
Having a standard folder structure can keep your files neat and tidy and save you time looking for data. It can also help if you are sharing files with colleagues and having a standard place to put working data and documentation.
Automation
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How can you automate any repetitive tasks?
Objectives
First learning objective. (FIXME)
Often, tasks that need to be done over and over again by a human can be opportunities for human error to sneak in. Setting up an automated way of doing this can eliminate this issue. Anything from an excel formula or macro to coding in a data science frameword can help.
Ways you can automate things:
- Spreadsheet Macros and formulas
- MacOS- Automator
- Win 10- Task scheduler
- Microsoft flow or Google script
- Learning to code in Python or R - Talk to your local hacky hour or Software Carpentry people
Beginner
Let’s thing about the repetitive tasks that you could automate- do you always rename files the same way? Do you manually copy files across?
Advanced
Could you code up your work so its completely automated?
Key Points
First key point. Brief Answer to questions. (FIXME)
Versioning
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is a version control system?
Are you keeping track and logs of your analysis?
Objectives
First learning objective. (FIXME)
Version control system
A version control system allows users to keep track of changes in your Data or Process
Are you keeping track of any versions or logs made by the software in use?
Make sure you have a copy of every step you have completed and if possible, version numbers for the program you are using and any libraries. Programs change over time and this can alter your results if someone asks to replicate your work post publication.
Never make alterations to your raw data files
Instead, make a copy of the raw data files and keep them somewhere safe (like Research Vault). That way, if you need to redo your work or you find an error earlier in your workflow, you have an original baseline to start from.
Write down versions of analysis software
Write down the versions of analysis software (like SPSS or NVIVO etc) AND hardware (MRI machines etc). Your documentation is a great place for this, but even just in your lab notebook will work.
Random Number Generator
If you are using random numbers in your research, save your random seed generator number as part of your working data. This way, you can later reproduce your results.
Beginner
Copy your raw data to a cloud storage solution such as Research Vault for safe keeping.
Intermediate
If you are using a workflow program (Galaxy, KNIME, a virtual lab like EcoCloud or TINKER Humanities,Arts and Social Science Virtual Lab, you can copy your workflow and save it as part of your documentation. Write the date that you ran the workflow if versions of the software are not available.
Advanced 1
If you are writing scripts (R/Python/Matlab etc), use Git.
Note: Griffith has a gitlab version you can use for private repositories. Also record the version of R/Python/Matlab, the operating system you are using and the version numbers of any library you are using.
If you are using the HPC, also record the version of any modules you used there.
Advanced 2
If you’ve heard of Docker or Singularity and you are interested, come talk to hacky hour/eResearch Services
External Resources
- Reproducible research in Git
- What is git
- Learn Software Carpentry in Git
- Git for Scientists
- The Turing Way Version Control
Key Points
A version control system allows users to keep track of changes in your Data or Process
Cloud Storage of your Data
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Keep a copy of your data on the cloud
Keeping a copy of all your data (working, raw and completed) in the cloud is incredbilty important. This ensures that if you have a computer failure, accidently delete your data or your data is corrupted, your research is restorable.
Griffith has three different types of cloud storage made especially for research
Research Drive
This would be a good place for your day-to-day working files. It is unlimited and you can share it with people at Griffith (but not externally). This works the same as G drive.
Research Space
This has a ‘sync’ client that automatically copies your files from your computer to the cloud- just like dropbox or google drive. You can use this to share with people external to the university. You can add them with a Linkedin profile, Griffith, other university or Gmail account, or you can share with a URL, password and expiry date. This is also unlimited storage- you are given 5GB initially, and to add an unlimited folder, just click ‘Add more storage’.
Research Vault
For your long term backups. Perfect place to store a safe copy of your raw data or the research of your PhD student who has completed and is leaving the institute.
Not sure which one is best? Click here
Beginner
Get your data into Research Storage - If you need help picking one, talk to the library or eResearch Support
Advanced
Build a policy for your team or group on where things are stored. Make sure the location of your data is saved in your documentation
Key Points
First key point. Brief Answer to questions. (FIXME)
Computer Security
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
Security
Ensuring that your computer and network are secured means that you have far less a chance of a data breach or hack.
Beginner
Have good strong passwords and encrypt your computer’s hard drive
Intermediate
Get set up on a password manager
Advanced
Let’s ensure your lab/office is encrypted and practicing safe habits Note: The boss’s computer is usually the most insecure
Encrypt your computer
- Encryption- https://www.griffith.edu.au/about-griffith/cybersecurity/data-protection
- Win 10 Encryption: https://www.windowscentral.com/how-use-bitlocker-encryption-windows-10
- Win 7 Encryption: https://www.microsoft.com/en-au/download/details.aspx?id=4794 (Call 55555 first and ask their advice as they can help you install this- it doesn’t look as simple as Win 10)
- Mac OS https://support.apple.com/en-au/HT204837
Strong passwords
- https://www.griffith.edu.au/passwords/password-management
- Video: https://youtu.be/PjHc8g8G9MU
- Find out if your email has been compromised https://haveibeenpwned.com/
- Use a password manager such as https://www.lastpass.com/business-password-manager
Using Multi-Factor Authentication when the option is available (Signing in with a password and an email to your account with a pin)
Avoid unsecure wifi - If its available, Eduroam is usually a better option than free wifi/cafe wifi
Use a VPN whenever you’re not at work
- https://intranet.secure.griffith.edu.au/computing/remote-access/accessing-resources/virtual-private-network (55 555 can help you out too)
Keeping your OS and products up to date (esp web browser)
- You can use Qualsys Browser Check to confirm your browser is set securely
- https://www.griffith.edu.au/about-griffith/cybersecurity/cybersecurity-at-home
Griffith provides Symantec anti-virus FREE for Griffith staff and students https://intranet.secure.griffith.edu.au/computing/software/self-help-and-support/software-download-service4
Key Points
First key point. Brief Answer to questions. (FIXME)
Separating identifying variables from your data
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is sensitive data?
How can we make data non-sensitive and still useful?
Objectives
First learning objective. (FIXME)
Sensitive data are data that can be used to identify an individual, species, object, or location that introduces a risk of discrimination, harm, or unwanted attention. Major, familiar categories of sensitive data are: personal data - health and medical data - ecological data that may place vulnerable species at risk.
Separating or de-identifying your data generally occurs to protect an individuals privacy. According to the Australian Privacy Act 1988, “personal information is de-identified if the information is no longer about an identifiable individual or an individual who is reasonably identifiable”. De-identified information is no longer considered personal information and can be shared. More information on the Commonwealth Privacy Act can be located at https://www.legislation.gov.au/Details/C2016C00979
De-identifiying aims to allow data to be used by others for publishing, sharing and reuse without the possibility of individuals/location being re-identified. It may also be used to protect the location of archaeological findings, cultural data of location of endangered species.
Any identifiers (name, date of birth, address or geospatial locations etc) should be removed from main data set and replaced with a code/key. The code/key is then preferably encrypted and stored separately. By storing de-identified data in a secure solution, you are meeting safety, controlled, ethical, privacy and funding agency requirements.
Re-identifing an individual is possible by recombining the de-identifiable data set and the identifiers.
Australian practical guidance for De-identification (ARDC)
Australian Research Data Commons (ARDC) formerly known as Australian National Data Service (ANDS) released a fabulous guide on De-identification. The De-identification guide is intended for researchers who own a data set and wish to share safely with fellow researchers or for publishing of data. The guide can be located here https://www.ands.org.au/working-with-data/sensitive-data/de-identifying-data
Here are examples of practical guidelines available nationally
- The Australian Government’s Office of the Australian Information Commissioner (OAIC) and CSIRO Data61 have released a ‘De-identification Decision Making Framework’, which is a “practical guide to de-identification, focussing on operational advice”. The guide will assist organisations that handle personal information to de-identify their data effectively.
- The OAIC also provides high-level guidance on de-identification of data and information, outlining what de-identification is, and how it can be achieved. https://www.oaic.gov.au/agencies-and-organisations/guides/de-identification-and-the-privacy-act
- The Australian Government’s guidelines for the disclosure of health information, includes techniques for making a data set non-identifiable and example case studies. https://www.aihw.gov.au/reports-data
- Australian Bureau of Statistics’ National Statistical Service Handbook. Chapter 11 contains a summary of methods to maintain privacy.
- med.data.edu.au gives information about anonymisation https://www.aihw.gov.au/reports-data
- Office of the Information Commissioner Queensland’s guidance on de-identification techniques https://www.oic.qld.gov.au/guidelines/for-government/guidelines-privacy-principles/applying-the-privacy-principles/privacy-and-de-identification
Tips for managing de-identificatioin (ARDC)
- Plan de-identification early in the research as part of your data management planning
- Retain original unedited versions of data for use within the research team and for preservation
- Create a de-identification log of all replacements, aggregations or removals made
- Store the log separately from the de-identified data files
- Identify replacements in text in a meaningful way, e.g. in transcribed interviews indicate replaced text with [brackets] or use XML markup tags e.g.
Management of identifiable data (ARDC)
Data may often need to be identifiable (i.e. contains personal information) during the process of research, e.g. for analysis. If data is identifiable then ethical and privacy requirements can be met through access control and data security. This may take the form of:
- Control of access through physical or digital means (e.g. passwords)
- Encryption of data, particularly if it is being moved between locations
- Ensuring data is not stored in an identifiable and unencrypted format when on easily lost items such as USB keys, laptops and external hard drives.
- Taking reasonable actions to prevent the inadvertent disclosure, release or loss of sensitive personal information.
Safely sharing sensitive data guide (ARDC)
- ANDS’ De-identification Guide collates a selection of Australian and international practical guidelines and resources on how to de-identify datasets.
Attribution:
- Australian National Data Service. (2018). ANDS guide: De-identification. Retrieved from https://www.ands.org.au/__data/assets/pdf_file/0003/737211/De-identification.pdf
- Australian National Data Service. (2018). Safely sharing sensitive data. (2018). Retrived from https://www.ands.org.au/working-with-data/sensitive-data/sharing-sensitive-data
Key Points
First key point. Brief Answer to questions. (FIXME)
Identifiers
Overview
Teaching: 0 min
Exercises: 0 minQuestions
What is a DOI?
What is a PID?
Objectives
First learning objective. (FIXME)
Digital Object Identifier (DOI) and Persistent identifier (PiD)
Once you’ve completed your project, help make your research data discoverable, accessible and possibly re-useable using a PiD such as a DOI!
A Digital Object Identifier (DOI) is a unique alphanumeric string assigned by either a publisher, organisation or agency that identifies content and provides a PERSISTENT link to its location on the internet, whether the object is digital or physical. It might look something like this http://dx.doi.org/10.4225/01/4F8E15A1B4D89. The DOI or the Identifier is listed at the bottom of this record from Griffiths’ Research Data Repository.
DOIs are also considered a type of persistent identifiers (PiDs). An identifier is any label used to name some thing uniquely (whether digital or physical). URLs are an example of an identifier. So are serial numbers, and personal names. A persistent identifier is guaranteed to be managed and kept up to date over a defined time period.
Journal publishers assign DOIs to electronic copies of individual articles. DOIs can also be assigned by an organisation, research institutes or agencies and are generally managed by the relevant organisation and relevant policies. DOIs not only uniquely identify research data collections, it also supports citation and citation metrics.
Key messages:
- DOIs are a persistent identifier and as such carry expectations of curation, persistent access and rich metadata
- DOIs can be created for DATA SETS and associated outputs (eg grey literature, workflows, algorithms, software etc) - DOIs for data are equivalent with DOIs for other scholarly publications
- DOIs enable accurate data citation and bibliometrics (both metrics and altmetrics)
- Resolvable DOIs provide easy online access to research data for discovery, attribution and reuse
Beginner
Ensure data you associate with a publication has a DOI- your library is the best group to talk to for this.
Intermediate
- Learn more about how your DOI can potentially increase your citation rates by watching this 4m:51s video
- Learn more about how your DOI can potentially increase your citation rate by reading the ANDS Data Citation Guide
Advanced
Learn more about PiDs and DOIs https://www.ands.org.au/guides/persistent-identifiers-awareness|
Internal Resources
- Contact the Library team for advice on how to obtain a DOI upon project completion.
External Resources
Key Points
A DOI is a Digital Object Identifier
A PiD is a Persistent identifier