Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Hong Kong Baptist University Hong Kong Baptist University Library

Sources for Text-Mining: Academic Sources

Due to license agreements, users are NOT allowed to download excessive amount of content from library subscribed resources regardless of downloading methods.

  • Content includes articles, book chapters, images, among other materials
  • Violation will trigger automatic lockouts and prevent other users from accessing the same database
However, some library subscribed and open access resources DO allow data or text mining but certain terms and conditions apply.
  • Some resources (mostly open access ones) allow you to directly harvest their data
  • Some require you to use the data mining tools they provide
  • Some only allow if they conduct the process for you

The Library also develops increasing number of digital scholarship projects with a view to facilitating public access to the original research data and materials collected or created by HKBU faculty. In most cases, we can share these materials in a way that makes data / text mining possible. Contact us if you are interested in this.
LIBRARY SUBSCRIBED RESOURCES (that allow text mining)
The following resources allow data-mining with or without asking users to seek approval in advance. Please take note of the terms of use that came from either the license agreement with the Library or their corresponding websites.

Name of Resource Terms of Use
Brill resources
(except Brill Online Journals)

Authorized users may, in accordance with the copyright law of Hong Kong, use text mining technologies to derive information from the licensed materials and disseminate the results for non-commercial purposes if the research is original.
(summarized from license)
Cambridge Companions Online Authorized users may download, extract, store, index, and analyze data on their personal devices or secure network for non-commercial research. All data copies must be deleted once such research project ends. Authorized users may make data analysis results available on websites provided no original data is made available to others.
(summarized from license)
Emerald Note the online policy
Gale Primary Sources Note the online policy
Oxford Art Online Note the online policy
Oxford Bibliographies Online Note the online policy
Oxford Journals Note the online policy
Oxford Music Online Note the online policy
Oxford Reference Online Note the online policy
Sage Journals Note the online policy
Sage Videos Authorized users may use the licensed material to perform and engage in text mining /data mining activities for legitimate academic research and other educational purposes. Anything beyond educational use shall require SAGE's permission.
(Extract from license)
ScienceDirect Note the online policy
Springer Link Note the online policy
Taylor & Francis Note the online policy
Wiley Online Library Note the online policy
歷代書法碑帖集成 教職員、研究人員、學生及圖書館内使用者可於註冊的IP區段內使用本庫,包括但不限於檢索、瀏覽、數據挖掘、打印及下載等。該使用僅限於教育、學術研究等非營利用途。
(Extracted from license)
Book Data

HathiTrust Digital Library   FREE 
HathiTrust makes the texts of public domain works in its corpus available for research purposes. The works fall into two categories: non-Google-digitized volumes, which are freely available, and Google-digitized volumes, which are available through an agreement with Google.

HathiTrust+Bookworm   FREE 
A tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library. No login is required.

Google Books   FREE 
Search full text of books in many languages. Download books in the public domain. The Advanced Search allows you to filter for "full-view". Texts are in American English, British English, French, German, Italian, Spanish, Russian, Hebrew, and Chinese.

Google Books Ngram Viewer   FREE 
Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500- present. If you are interested in performing a large scale analysis on the underlying data, download of the corpora is available.

Google Books BYU View   FREE 
Compares The Corpus of Historical American English (COHA), Google Books (Standard), and the Google Books (BYU / Advanced) corpus in NGrams.

Cultoromics Bookworm Viewer   FREE 
Developed by Culturomics at Harvard, it is an interface tool for queries in the Google Books corpus. Users can run queries in highly selective corpora based on subject (books on world history, American books on science, etc.) though these corpora are much smaller than those in the full Google Books collection.

Chinese Text Project   FREE 
With over thirty thousand titles and more than five billion characters, the Chinese Text Project is the largest database of pre-modern Chinese texts in existence. The system also provides API, text tools, and more to facilitate online text mining.

Internet Archive: eBooks and Texts   FREE 
Offers over 10,000,000 fully accessible books and texts. Includes texts, audio, moving images, and software as well as archived web pages in their collection. Instructions for downloading in bulk.

Project Gutenberg   FREE 
The first producer of free electronic books and currently provides over 60,000 titles. Here is the Project's Terms of Use.

Online Books Page   FREE 
Lists over 3 million free books on the internet (includes Project Gutenberg, Hathi Trust, Google Books, publisher and institutional archives, etc). Provides a section on non-English language texts.


Wikidata   FREE 
It acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. Statistics shows what types of information can be provided, and Data Access provides instructions to download data.

Regional Based

Australian Data Archive   FREE 
ADA provides a national service for the collection and preservation of digital research data and to make these data available for secondary analysis by academic researchers and other users. Download of data requires approval.

Digital Public Library of America   FREE 
DPLA offers a single point of access to millions of items from libraries, archives, and museums around the United States. Data is available for bulk download in JSON files.

Chronicling America: Historical American Newspapers   FREE 
Collection of digitized historical newspapers from 1789-1924. OCR batch downloads available.

Europeana APIs   FREE 
Europeana is a digital library with millions of books, paintings, films, museum objects and archival records that have been digitized by more than 2,000 institutions across Europe.

Taiwan History Digital Library   FREE 
THDL provides tagged full-text of primary historical material of Taiwan, focusing on Qing dynasty. The system also provides sophisticated context discovery tools for users to analyze chronological, geographic, and source information.

Taiwan Biographical Ontology   FREE 
TBO stores biographical information of 19,372 Taiwanese elites and people closely associated with them. Registered users can visualize the data, mine the data via more complicated online features, or even download the dataset for further analysis.

Subject Based

University of Oxford Text Archive   FREE 
OTA develops, collects, catalogs and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Materials include Shakespeare's plays, public speeches, books, and more.

WordHoard   FREE 
Contains the entire canon of Early Greek epic in the original and in translation, as well as Chaucer, Shakespeare, and Spenser. Texts are annotated or tagged by morphological, lexical, prosodic, and narratological criteria. User interface allows non-technical users to explore the greatly increased query potential of textual data for computer-assisted study.

Arxiv   FREE 
Open access to 1,600,000 e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. Bulk access available.

BioMed Central   FREE 
Over 403,000 full-text, peer-reviewed science, technology and medicine articles are available for text and data mining.

Public Library of Science. Provides access to its peer-reviewed articles. Provides a specific Text Mining Collection.

PubMed Central Databases and Text Mining Tools   FREE 
Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records.

Library Facility

Data Software
Available in the Library

View full table here.

Library Services

Look Out for Semester-Based
Data Software Training!

Review previous and upcoming training information here

Other Relevant Libgudes:

See other guides related to data management and analytic here

Find out more

Feel free to contact me if you have questions about
Research Data Services

Rebekah Wong
Head, Digital & Multimedia Services