Research Data Life Cycle
Overview
Data is a set of values of qualitative or quantitative variables that scholars draw upon to support their claims and/or produce new knowledge.
We will go over the six steps of the Data Life Cycle with corresponding tools recommended to you.
Step 1: Data Creation
Before collecting data, it is best to plan ahead and ask yourself: What types and formats of data will be collected? Is there any copyright issue involved? What are the best approaches to store and back up data?
You may go to the Research Data Management library guide for more information.
Data can be collected…
Through observation – generally be collected once and is unique
By experimenting – through experiments; in general can be repeated
By simulation – test models; usually can be reproduced
By researching sources – deriving from literature, manuscripts, publications, etc.
By data processing – combining, reprocessing, (re)grouping, etc. of data created before
By using existing data
Library Services
If you have difficulties in filling out research data management plans (DMP) requested by publishers or fund agencies, please feel free to contact our Scholarly Communications team at lib-sct@hkbu.edu.hk.
Step 2: Data Processing
This step involves data inputting (if the raw data is not collected in a digital format), data conversion (from one system to another system, or from one format to another format), and data cleaning.
Data cleaning requires tedious and time-consuming manual work, but its importance should not be underestimated. Proper data cleaning can prevent researchers from coming back to this step at a later stage of the research and avoid drawing false conclusions.
The following data cleaning tips can serve as a starting point:
Clear field labeling – make sure you can understand the labels even after one year of time
Remove unwanted observations – including duplicate or irrelevant observations
Filter unwanted outliers – only for the suspicious measurements that are unlikely to be accurate
Handle missing data – by dropping observations with missing values or inputting missing values based on other observations
Fix structural errors – including typos, inconsistent capitalization, and inconsistent name formats
Controlled vocabularies may help – e.g., develop a small dictionary to remind yourself to use "United States" (instead of "USA" or "America") or "computer" (instead of "computers" or "PC") throughout the document
Beware of strange characters – especially when you directly copy and paste web contents into an Excel; an invisible strange character is usually added at the end of a sentence
Some of these points are mentioned in EliteDataScience. Go there for a more comprehensive explanation.
Software Recommendations
OpenRefine FREE
Official Website: http://openrefine.org/#download_openrefine
User Guide: https://libjohn.github.io/openrefine/start.html
Step 3: Data Analysis
This is the most challenging but also most exciting part of the cycle. It can involve quantitative analysis, qualitative analysis, machine learning, etc.
This guide does not intend to cover basic statistics that can be found on the Internet easily. (If you have no idea which internet sites to use, you may start with Statistics How To.) We hope to introduce commonly-used software tools instead.
Software Recommendations
Quantitative Analysis Software
SPSS INSTALLED IN MLC
More often used by social scientists
Official Website: https://www.ibm.com/analytics/spss-statistics-software
User Guide: https://stats.idre.ucla.edu/spss/
Stata INSTALLED IN MLC
More often used by social scientists
Official Website: https://www.stata.com/
User Guide: https://stats.idre.ucla.edu/stata/
OriginPro INSTALLED IN MLC
More often used by scientists and engineers
Official Website: https://www.originlab.com/
User Guide: https://www.originlab.com/doc/User-Guide
Good Calculators: Mathematics Statistics and Analysis Calculators FREE
This website provides a variety of handy online calculators, such as math and statistics, engineering and conversion calculators
Official Website: https://goodcalculators.com/statistics-calculators/
Qualitative Analysis Software
NVivo INSTALLED IN MLC
For text mining and analysis
Official Website: https://lumivero.com/products/nvivo/
MaxQDA INSTALLED IN DAR OF MLC
For text mining and analysis
Official Website: http://www.maxqda.com
Programming Languages to Provide an Integrated Support from Data Preparation to Web Applications
The following two programming languages are quite powerful and can support many aspects of the data life cycle, including web crawling, statistics, data manipulation, machine learning, data visualization, web applications, etc.
Python FREE
Official Website: https://www.python.org/
User Guide: https://swcarpentry.github.io/python-novice-inflammation/
R FREE
Official Website: https://www.r-project.org/
User Guide: https://swcarpentry.github.io/r-novice-inflammation/
Comparison between Python and R: https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis
Library Services
Stay tuned for our semester-based Research Data Tools Series Workshops if you want to learn how to use these software. We also offer a limited number of course-embedded basic training each year.
Step 4: Data Storage
This step involves short-term measures such as proper file version control during a research project and long-term data archiving measures to migrate data to the best format and store it in the most suitable medium for your or your company's future use. You may learn more about this through TechTarget.
Tool Recommendations
Git FREE
A version control tool
Official Website: https://git-scm.com/
User Guide: https://swcarpentry.github.io/git-novice/
Step 5: Data Sharing
Data storage is more on internal use of data, but data sharing refers to open data that can be accessed and re-used by the public for free. Open data is not only a trend but also an obligation that researchers are recommended to meet for the benefits of academia and the society. Some major publishers also request authors to share their data, e.g., Nature and Science.
Data can be shared in its original form (after removing privacy and sensitive information) through publicly accessible data repositories. Researchers can also choose to share their data through data visualizations or developing interactive web applications.
Tool Recommendations
Data Repositories
There are many data repositories available online for you to share data sets; some are subject-based, material-type specific, or region specific. If you are new to this area, you may want to start from the following three platforms:
Figshare FREE
A multi-disciplinary repository for research data, managed by a commercial firm
Official Website: https://figshare.com/
Harvard Dataverse FREE
A multi-disciplinary repository for research data, managed by Harvard University
Official Website: https://dataverse.harvard.edu/
Github FREE
Mainly for sharing codes
Official Website: https://github.com/
You can also develop your own data management / sharing systems using open source data platforms:
CKAN FREE
Both the US and HK Governments use this open source platform to share governmental data
Official Website: https://ckan.org/
Data Visualization Software
Tableau INSTALLED IN MLC
Official Website: https://www.tableau.com/
User Guide: https://data-flair.training/blogs/category/tableau/
Gephi FREE
Official Website: https://gephi.org/
User Guide: https://medium.com/@Luca/guide-analyzing-twitter-networks-with-gephi-0-9-1-2e0220d9097d
Flourish FREE (partially)
Official Website: https://flourish.studio/
User Guide: https://flourish.studio/developers/tutorial/
Library Services
The Library's Digital Initiatives and Research Cluster has a team of project managers, programmers and project assistants ready to provide support for Digital Scholarship Services to help faculty members develop interactive web applications for public access. We offer Digital Scholarship Grant and a track of non-grant application. Contact us at libms@hkbu.edu.hk to discuss potential ideas and make good use of your data!
Step 6: Re-use of Data
There are many free and subscribed data resources available for researchers to re-use. We have prepared another library guide for data resources, please visit the guide on Sources for Data-Mining.
Top Analytics Software 2016-18
(developed by KDnuggets)
Useful Online Learning Resources for Data Science
Data Software Training Videos
The library has collaborated with Apps Resource Centre to develop a series of Python and SPSS online training videos. These videos are specifically designed for local students who have no prior knowledge about programming or statistics.
Turn Your Data into Digital Scholarship Projects
Since 2015, HKBU Library has been working closely with many faculty members to present and visualize research data in the form of digital scholarship projects. The Library now boasts a portfolio of 20+ Digital Scholarship Projects from across different disciplines, sharing valuable scholarly sources that benefit and impact academia and beyond.
Watch these videos on why and how HKBU researchers share data.
You can also click to see the full list of video sharing.