PhD on Track » Open Science » Data management

Data management

picture - data management

If your PhD contains research data, you will have to think about how to deal with those data. In this section, you will learn what the acronym FAIR means, how to make data management plans, how to store and archive your data, how to provide good and sufficient metadata, how to structure data files, and how to find out if your data are too sensitive to share.

FAIR data

The FAIR principles were originally introduced in 2016 (Wilkinson et al., 2016), and the Research Council of Norway (NFR) states:

“The FAIR guiding principles for scientific data management and stewardship are included as a main principle…” (NFR, 2017)

The same principles govern the data policy in the Horizon 2020 framework, and are followed by more and more academic publishers, such as Nature Publishers. The Norwegian government also states that the FAIR principles should govern all publicly funded research in Norway (Meld. St. 25 (2016–2017) The Humanities in Norway, summary in English)

In order to comply with the FAIR guidelines, data should be

  • findable
  • accessible
  • interoperable
  • re-usable

One of the keys to complying with this principle is to use a decent data management plan. We will present the necessities of a good data management plan below, but first some information on the FAIR guidelines.

Findable data

To be findable, data should be uniquely and persistently identifiable, which means that it should be possible to find the same object at any point in time by using persistent links. The data should also minimally include enough basic machine-readable metadata to separate it from other data.

Accessible data

Accessible data are those obtainable by machines and humans after appropriate authorization. Access must be granted through a well-defined protocol.

Interoperable data

This means that data and metadata are machine-readable and formatted according to well-known vocabularies or ontologies. In other words, data must be both correct and understandable for a machine in order to be interoperable.

Re-usable data

A further requirement is re-usability, which can only be ensured if the FAI-part above are followed. In addition: metadata should be described sufficiently well to allow it to be automatically linked or integrated with other data sources. Published data should have enough metadata to enable correct referencing.

Treating your data according to the FAIR-principles will make sure that the data you collect are re-usable and verifiable for fellow researchers according to the national strategy on transparent and reproducible data. One key element in this is to make sure you think ahead when it comes to data management.

Data management plans

A data management plan (DMP) describes the data management life cycle for data to be collected, processed and/or generated. The data management plan will state how you treat data from project start to end. Note that the data management plan can be regarded as part of the research process, and should be included in the final project publication.

Public funders such as the European Commission or the Research Council of Norway require you to provide a data management plan as part of your project description, and there could very well be a note on data management in your PhD agreement. There are guidelines on what the plan should contain; here are two examples:

As there is a growing tendency touse the FAIR-guidelines in more areas of research, you should try to use them to create a plan for your data even if your current research is not funded directly by an external party. Following the FAIR-guidelines is recommended by many of the Norwegian higher educational institutions, for example the University of Oslo data management (Norwegian only), UiT The Arctic University of Norway or the Norwegian University of Science and Technology.

Templates for data management plans

A general data management plan contains information such as this:

  • The research project, people involved, and whether it is part of a larger project, e.g. the project could be your PhD.
  • Who is responsible for following up the DMP and who has rights to manage and/or access the data. This could be just you, your research group, or other collaborators.
  • Details of the data you are going to collect or generate. These could be observations, simulations, interviews, etc. This part could also include information on existing data, which requires that you have searched for research data created by others.
  • How documentation and metadata will be created, and what file formats will be used.
  • Where the data will be stored during the project, expected file size, procedures for backup, storage and archiving. The local IT department can probably support you here.
  • Where the data will be archived at the end of the project, and if they will be shared.
  • If the data are sensitive, how this will be handled.

It is usually best to use a DMP template and generate a plan containing the necessary information. A few different templates exist, and you could of course use the Horizon2020 template also for non-EU funded research. If your PhD needs to be registered with NSD or preapproved by REK (personal/sensitive data), the DMP generator published by NSD is probably the better choice. Note that your institution may also recommend certain templates for creating data management plans.

Storage and archiving

In many PhDs, the amount of data produced is small enough to be stored on your own computer, or a shared area provided by your institution if you collaborate with others. Most universities and university colleges provide an institutional home area for you to use, which is automatically backed up regularly, usually every night. If you need more disc space, or have special requirements for data storage, your local IT department can help you. Note that if you work with sensitive data, there are stricter requirements for safe storage, e.g. where to store data, encryption, passwords for access, etc.

Most institutions provide their own services for collaborative use. This is recommended to ensure storage for as long as you need, and to limit access to those who have the right of access . Dropbox and Google docs should be avoided.

You may have different needs for storage during your PhD, and for archiving when it is finished. Some data can be shared, while other data should be archived in a closed repository. If you do not have clear requirements for what archive to use, you can search research data or data repositories to find the one that best suits your data.

Documentation and metadata

Is pre-registration of your study a good idea? Read more about how pre-registration can be a step in documenting your research.

Data documentation is an essential part of data management. The documentation should include what data you have collected, what methods were used, and what the research context is. If there are any limitations, this should be stated. This is especially important if you plan to share data, but is also useful for yourself, so that you are in complete control of your data sets, and do not risk having to re-do the data collection. Provide information on what the data represent and if they have been processed in any way.

Metadata are data about the data. They are often provided in addition to more extensive documentation, to give brief information on the data. This could include the author, a descriptive title of the data set, date of collection, keywords, etc. There should be enough metadata to enable you and others to understand how to use the data in the future. Some data archives or repositories have metadata forms with required fields. If at all possible you should use metadata from a widely accepted standard in your field of study; this will make your data easier to find through data search engines.

How to create metadata? To get some ideas on how to do this, you might have a look at the Stanford Libraries website: Creating Metadata.

Structuring data files

It may seem obvious that data should be structured so that they are easy to understand and reuse. However, there are common mistakes like using acronyms and abbreviations that may seem easy to understand now, but turn out to make little sense later.
Try to follow the guidelines below:

  • Provide units of measurement when applicable.
  • Use naming conventions that are easy to understand, and be consistent, with e.g. file names on the form datatype_date_location.ext. If you use acronyms and abbreviations in file names, headers, etc, explain them in the metadata.
  • Avoid names that are too long.
  • Avoid special characters.
  • Use underscore (_) instead of space in file names, as some programmes have difficulty reading files with spaces in the file name.
  • Use persistent file formats.

Technical security measures

There is no such thing as too much data security, but there is no need to make your data management more cumbersome than needed. In a low-risk setting such as this example of sensitive data, your smartphone could be the perfect tool for the job; high quality, digital files, unlimited storage, small size, high power capacity, easy to transfer the data for processing, etc. Setting up a rigorous system with rotating passwords, ever-changing encryption keys and file transfer through encrypted channels is not needed if the data are not considered as sensitive enough.

The key point is that you need to think about this before you start on the data collection. As is obvious from the examples above, there are no fixed “security categories” when dealing with personal and sensitive data, and you will have to define your need for security yourself.

Recording of sound and video

Video and sound recordings can be highly personally identifiable, remember that even the voice of the interviewees can be enough to identify them! Cartooning, censoring and voice distortion are the only ways to anonymize such data, but this option is not viable for all kinds of projects, such as those studying spoken dialect or facial expressions. Note that anonymization should be done on a computer without internet connection. Anonymized data can be processed on a computer with internet but remember that the content of any dialogue still remains; what is said can be enough to identify people.

You need to be in full control of the equipment used for recording sound or video; you should not leave it unattended or lend it to unauthorized persons. Note that equipment with an internet connection, such as a smartphone, should never be used to record personal information unless you can ensure that all communications are shut off. Equipment without an internet connection and which uses removable storage units is generally the best option. If encryption of the storage media is not an option, make sure that the physical media are securely stored in a safe or similar, and transfer any sensitive or personal data to more secure storage, such as a computer without internet connection and encrypted harddrive. Long-time storage of such data for the purpose of further research will need specific secure long-time storage facilities, and the approval of such storage from the relevant authorities

Transferring data

Pay attention to the way you transfer your data between units. The transfer of data should be as safe as the storage unit; for example, you cannot send highly sensitive data in emails. At the highest levels of security, an encrypted connection is needed, but a physical transfer using a wire or card-reader would also work well. If you transcribe sensitive recorded data, make sure that there is no one in the room with you, looking over your shoulder.

In cases where email is the best solution for transferring personal data, use software such as 7zip for encryption before sending the file. Sensitive data should as a general rule never be sent as attachments to email going over a normal mail server. Ask your IT department if sending emails internally in you institution is regarded as safe or not.

Processing and storing

Data sometimes need to be transferred to your desktop computer for processing. Anonymization of data from an interview is one example, but there could be numerous reasons for needing to process the raw data before analysing them. If you have data in the high-risk category, you should probably have a designated computer with permanently deactivated communications for the job. No internet or similar connections should be available on a computer used for processing high-risk data. You also need to make sure that the processing can take place in a secure place; a public reading- room or internet cafe with people looking over your shoulder should never be used.

Low-risk data could be stored on removable units, such as memory cards or USB-sticks, but there must at least be some basic level of security for personal data. Encryption of the removable storage is a possibility, or you could store media in a safe location. Higher-risk data need specific storage units using high-level security measures such as secure servers with encrypted disks and strictly regulated access, even physical access to the login- terminals themselves.

Sharing

Some data can be shared, and some should indeed be shared! Sharing data is generally done through uploading, or archiving, the data into a repository. Even if your funder or your institution is a strong supporter of the FAIR principles, you are still responsible for not sharing data that should not be shared. Remember: “data should be as open as possible and as closed as necessary”. Read through the short section on when not to share your data before you upload your data into an open repository. Note that uploading data into a database is not the same as sharing them; there are many options for secure storing of your data, which is not the same as sharing them. Sharing and storing are not the same.

Two of the key notions behind data sharing is that they should allow both new research and that conclusions can be verified. With this in mind, you should make a decision regarding the rawness of the data you share; for very large datasets a certain degree of processing of the data before sharing is the obvious choice, but every step of processing you take could limit the possibilities of future use.

Encryption

Encryption of mobile storage media is possible on a normal Windows computer by turning on “BitLocker” for the selected drive. If you use a Mac, encryption can be done in “disk utility” after creating an image out of the folder you wish to encrypt. Encryption is necessary for all removable storage media containing high-risk data. The encryption key (the password) must never be accessible by any unauthorized persons and should not be stored in connection with the encrypted disk itself.

Data management vocabulary

Below you can find some commonly used Norwegian and English terms used in data management and research activities in the higher education institutions in Norway.

“PhD” is written “ph.d.” in Norwegian, and you are not called a “student” in Norwegian, but rather a “candidate”.

Norwegian-English translations

The following list is adapted from the Norwegian Data Protection Authority’s list of words and expressions used in privacy and data protection. You can find the complete list, including an English-Norwegian version, on the site of the Data Protection Authority.

Norwegian English
avvik discrepancy
avviksbehandling discrepancy processing
behandling processing
behandling av personopplysninger processing of personal data
behandling av elektroniske hjelpemidler processing by automatic means
behandlingsansvarlig data controller
billedopptak image recording
databehandler (data) processor
Datatilsynet Data Protection Authority
den opplysningen gjelder data subject
den registrerte data subject
enkeltpersoner natural persons
etablert established
fagforeningsmedlemsskap trade-union membership
fødselsnummer national identity number
geografisk virkeområde territorial extent
helsepersonell health professionals
informasjonssikkerhet data security
informasjonssystem information system
innsyn access to information
innsynsrett right to access (information)
interesseavveining balancing of interest
juridisk person legal person
kameraovervåking video surveillance
kobling alignment of data
konsesjon licence
konsesjonsplikt obligation to obtain a licence;
licensing obligation
korrekt accurate
krav om reservasjon mot behandling demand for a bar on processing
kriterier for akseptabel risiko criteria for acceptable risk
legitimasjonskontroll verification of proof of identity
leverandør data supplier
meldeplikt obligation to give notification;
notification obligation;
obligation to notify
meldepliktig subject to notification
melding notification
overføring til land utenfor EU / EØS transfer to third countries
overføring til utlandet Trans Border Data Flow
overføringsmedium transfer medium
overvåking surveillance
paragraf section
personopplysninger personal data
personopplysningsforskriften The Personal Data Regulations
personopplysningsloven The Personal Data Act
personregister personal data filing system
personregisterloven Personal Data Filing System Act
personvern privacy, data protection
personvernfremmende teknologi Privacy Enhancing Technology
Personvernnemnda Privacy Appeals Board
personvernombud Data Protection Official/Officer
privatliv, personvern, m.m. privacy
reservasjonsregister Central Marketing Exclusion Register
retting rectification
saklig virkeområde substantive scope
sammenstilling av data alignment of data
samtykke consent
sikkerhetsmål security objective
sikkerhetsrevisjon security audit
sikkerhetsstrategi security strategy
sletting erasure
tilfredsstillende beskyttelsesnivå adequate level of protection
tredjeland third country
utlevering av personopplysninger disclosure of personal data
varsling whistleblowing
ødeleggende programvare malicious software;
malware
(Updated: April 2018)


The UHR dictionary of academic terms

The Norwegian Association of Higher Education Institutions (UHR) has created a short dictionary (termbase). In this dictionary, you will find translations of more than 2000 administrative terms from the two written languages in Norway to English, and vice versa.

Useful resources

CESSDA ERIC (the Consortium of European Social Science Data Archives European Infrastructure Consortium) provides an expert tour guide on data management. The guide aims to help researchers make their data findable, understandable, sustainably accessible, and reusable.

Reference

Wilkinson, M.D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A.,…Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, 160018. https://doi.org/10.1038/sdata.2016.18

A s k -u s