HathiTrust Extracted Features open data set

In a recent press release, the HathiTrust announced that the fulltext of the works in the HathiTrust Digital Library (HTDL) is available for analysis via the Extracted Features Data Set. This open access data set provides quantitative information about word and line counts, parts of speech, and other details within each page of every volume in the HTDL. In addition to these larger-scale investigations, the EF Dataset also allows researchers to closely analyze the contents of a given volume or subset of volumes.  The HTDL comprises over 13 million volumes, over 5 billion pages, and over 2 trillion words.

“The Extracted Features Dataset creates opportunities for scholarship and teaching that were previously impossible,” said J. Stephen Downie, co-director of HathiTrust Research Center and Associate Dean for Research and Professor at the School of Information Sciences, University of Illinois at Urbana-Champaign. “We look forward to seeing how the scholarly community takes advantage of the EF dataset in their research, labs, and classrooms.”

For more information about the Extracted Features Dataset and access to it, go to https://analytics.hathitrust.org/datasets. The HTRC EF Dataset is released under a Creative Commons CC-BY license. Download information can be found at the DOI in the formal dataset citation below:

Boris Capitanu; Ted Underwood; Peter Organisciak; Timothy Cole; M. Janina Sarol; J. Stephen Downie (2016): The HathiTrust Research Center Extracted Features Dataset. 1.0 [Dataset]. HathiTrust Research Center. Dataset. http://dx.doi.org/10.13012/J8X63JT3

Contact info: htrc-help@hathitrust.org

Leave a Comment

Skip to toolbar