• Saturday, December 14, 2024

Harvard & Google to Release AI Dataset of 1M Public-Domain Books

Harvard and Google to release a dataset of 1M public-domain books, supporting AI research via Harvard's Institutional Data Initiative (IDI)
on Dec 13, 2024
Harvard Google AI Books Dataset

Harvard University, in collaboration with Google, is set to release a massive dataset comprising around 1 million public-domain books, including timeless works by Dickens, Dante, and Shakespeare. These classics, free from copyright due to their age, span various genres, languages, and authors.

This initiative contains books from Google’s extensive book-scanning project, Google Books, with Google playing a key role in distributing this vast collection. Although details on the release date and method remain unclear, the dataset is part of Harvard's Institutional Data Initiative (IDI), launched to create a “trusted conduit for legal data for AI.”

The IDI, first announced in March, has now formally launched with financial support from Microsoft and OpenAI. Greg Leppert, the initiative’s executive director, highlighted its goal to “level the playing field” by granting access to this dataset to research labs, AI startups, and others aiming to develop large language models (LLMs).

Post a comment

Your email address will not be published. Required fields are marked *

0 comments

    Sorry! No comment found for this post.