Sources for Text-Mining
Due to license agreements, users are NOT allowed to download excessive amount of content from library subscribed resources regardless of downloading methods.
Content includes articles, book chapters, images, among other materials
Violation will trigger automatic lockouts and prevent other users from accessing the same database
However, some library subscribed and open access resources DO allow data or text mining but certain terms and conditions apply.
Some resources (mostly open access ones) allow you to directly harvest their data
Some require you to use the data mining tools they provide
Some only allow if they conduct the process for you
The Library also develops increasing number of digital scholarship projects with a view to facilitating public access to the original research data and materials collected or created by HKBU faculty. In most cases, we can share these materials in a way that makes data / text mining possible. Contact us if you are interested in this.
Library Subscribed Resources (Text-mining Allowed)
The following resources allow data-mining with or without asking users to seek approval in advance. Please take note of the terms of use that came from either the license agreement with the Library or their corresponding websites.
Open Access Resources
Book Data
HathiTrust Digital Library FREE
https://www.hathitrust.org/datasets
HathiTrust makes the texts of public domain works in its corpus available for research purposes. The works fall into two categories: non-Google-digitized volumes, which are freely available, and Google-digitized volumes, which are available through an agreement with Google.
HathiTrust+Bookworm FREE
https://bookworm.htrc.illinois.edu/develop/
A tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library. No login is required.
Google Books FREE
Search full text of books in many languages. Download books in the public domain. The Advanced Search allows you to filter for "full-view". Texts are in American English, British English, French, German, Italian, Spanish, Russian, Hebrew, and Chinese.
Google Books Ngram Viewer FREE
https://books.google.com/ngrams
Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500- present. If you are interested in performing a large scale analysis on the underlying data, download of the corpora is available.
Chinese Text Project FREE
With over thirty thousand titles and more than five billion characters, the Chinese Text Project is the largest database of pre-modern Chinese texts in existence. The system also provides API, text tools, and more to facilitate online text mining.
Internet Archive: eBooks and Texts FREE
https://archive.org/details/texts
Offers over 10,000,000 fully accessible books and texts. Includes texts, audio, moving images, and software as well as archived web pages in their collection. Instructions for downloading in bulk.
Project Gutenberg FREE
The first producer of free electronic books and currently provides over 60,000 titles. Here is the Project's Terms of Use.
Online Books Page FREE
http://onlinebooks.library.upenn.edu/
Lists over 3 million free books on the internet (includes Project Gutenberg, Hathi Trust, Google Books, publisher and institutional archives, etc). Provides a section on non-English language texts.
Wikipedia
Wikidata FREE
https://www.wikidata.org/wiki/Wikidata:Main_Page
It acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. Statistics shows what types of information can be provided, and Data Access provides instructions to download data.
Regional Based
Australian Data Archive FREE
ADA provides a national service for the collection and preservation of digital research data and to make these data available for secondary analysis by academic researchers and other users. Download of data requires approval.
Digital Public Library of America FREE
DPLA offers a single point of access to millions of items from libraries, archives, and museums around the United States. Data is available for bulk download in JSON files.
Chronicling America: Historical American Newspapers FREE
https://chroniclingamerica.loc.gov/
Collection of digitized historical newspapers from 1789-1924. OCR batch downloads available.
Europeana APIs FREE
https://pro.europeana.eu/about-us/services-and-tools
Europeana is a digital library with millions of books, paintings, films, museum objects and archival records that have been digitized by more than 2,000 institutions across Europe.
Taiwan History Digital Library FREE
http://thdl.ntu.edu.tw/index.html
THDL provides tagged full-text of primary historical material of Taiwan, focusing on Qing dynasty. The system also provides sophisticated context discovery tools for users to analyze chronological, geographic, and source information.
Taiwan Biographical Ontology FREE
TBO stores biographical information of 19,372 Taiwanese elites and people closely associated with them. Registered users can visualize the data, mine the data via more complicated online features, or even download the dataset for further analysis.
Subject Based
University of Oxford Text Archive FREE
https://ota.bodleian.ox.ac.uk/repository/xmlui/
OTA develops, collects, catalogs and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Materials include Shakespeare's plays, public speeches, books, and more.
Arxiv FREE
Open access to 1,600,000 e-prints in physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. Bulk access available.
BioMed Central FREE
https://www.biomedcentral.com/about/policies
Over 403,000 full-text, peer-reviewed science, technology and medicine articles are available for text and data mining.
PLOS FREE
Public Library of Science. Provides access to its peer-reviewed articles. Provides a specific Text Mining Collection.
PubMed Central Databases and Text Mining Tools FREE
https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/
Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records.