Sources for Text-Mining

Due to license agreements, users are NOT allowed to download excessive amount of content from library subscribed resources regardless of downloading methods.

However, some library subscribed and open access resources DO allow data or text mining but certain terms and conditions apply. 

The Library also develops increasing number of digital scholarship projects with a view to facilitating public access to the original research data and materials collected or created by HKBU faculty. In most cases, we can share these materials in a way that makes data / text mining possible. Contact us if you are interested in this. 

Library Subscribed Resources (Text-mining Allowed)

The following resources allow data-mining with or without asking users to seek approval in advance. Please take note of the terms of use that came from either the license agreement with the Library or their corresponding websites. 

Open Access Resources

Book Data

HathiTrust Digital Library    FREE 

https://www.hathitrust.org/datasets

HathiTrust makes the texts of public domain works in its corpus available for research purposes. The works fall into two categories: non-Google-digitized volumes, which are freely available, and Google-digitized volumes, which are available through an agreement with Google. 

HathiTrust+Bookworm    FREE 

https://bookworm.htrc.illinois.edu/develop/

A tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library. No login is required.


Google Books    FREE 

https://books.google.com/

Search full text of books in many languages. Download books in the public domain. The Advanced Search allows you to filter for "full-view". Texts are in American English, British English, French, German, Italian, Spanish, Russian, Hebrew, and Chinese.

Google Books Ngram Viewer    FREE 

https://books.google.com/ngrams

Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500- present. If you are interested in performing a large scale analysis on the underlying data, download of the corpora is available.

Chinese Text Project    FREE 

https://ctext.org/

With over thirty thousand titles and more than five billion characters, the Chinese Text Project is the largest database of pre-modern Chinese texts in existence. The system also provides API, text tools, and more to facilitate online text mining.

Internet Archive: eBooks and Texts    FREE 

https://archive.org/details/texts

Offers over 10,000,000 fully accessible books and texts. Includes texts, audio, moving images, and software as well as archived web pages in their collection. Instructions for downloading in bulk.

Project Gutenberg    FREE 

https://www.gutenberg.org/

The first producer of free electronic books and currently provides over 60,000 titles. Here is the Project's Terms of Use.

Online Books Page    FREE 

http://onlinebooks.library.upenn.edu/

Lists over 3 million free books on the internet (includes Project Gutenberg, Hathi Trust, Google Books, publisher and institutional archives, etc). Provides a section on non-English language texts.

Wikipedia

Wikidata    FREE 

https://www.wikidata.org/wiki/Wikidata:Main_Page

It acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. Statistics shows what types of information can be provided, and Data Access provides instructions to download data. 

Regional Based

Australian Data Archive    FREE 

https://ada.edu.au/

ADA provides a national service for the collection and preservation of digital research data and to make these data available for secondary analysis by academic researchers and other users. Download of data requires approval.

Digital Public Library of America   FREE 

https://dp.la/

DPLA offers a single point of access to millions of items from libraries, archives, and museums around the United States. Data is available for bulk download in JSON files.

Chronicling America: Historical American Newspapers   FREE 

https://chroniclingamerica.loc.gov/

Collection of digitized historical newspapers from 1789-1924. OCR batch downloads available. 

Europeana APIs   FREE 

https://pro.europeana.eu/about-us/services-and-tools

Europeana is a digital library with millions of books, paintings, films, museum objects and archival records that have been digitized by more than 2,000 institutions across Europe.

Taiwan History Digital Library   FREE 

http://thdl.ntu.edu.tw/index.html

THDL provides tagged full-text of primary historical material of Taiwan, focusing on Qing dynasty. The system also provides sophisticated context discovery tools for users to analyze chronological, geographic, and source information. 

Taiwan Biographical Ontology   FREE 

https://tbio.daoyidh.com/

TBO stores biographical information of 19,372 Taiwanese elites and people closely associated with them. Registered users can visualize the data, mine the data via more complicated online features, or even download the dataset for further analysis. 

Subject Based

University of Oxford Text Archive   FREE 

https://ota.bodleian.ox.ac.uk/repository/xmlui/

OTA develops, collects, catalogs and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Materials include Shakespeare's plays, public speeches, books, and more.

Arxiv   FREE 

https://arxiv.org/

Open access to 1,600,000 e-prints in physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. Bulk access available.

BioMed Central   FREE 

https://www.biomedcentral.com/about/policies

Over 403,000 full-text, peer-reviewed science, technology and medicine articles are available for text and data mining.

PLOS   FREE 

https://plos.org/

Public Library of Science. Provides access to its peer-reviewed articles. Provides a specific Text Mining Collection.

PubMed Central Databases and Text Mining Tools   FREE 

https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/

Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records.

Other Resources