–“How is the legacy data you choose entered into the AI? Is some of it scanned?”–
Most human knowledge is now available digitally, and automated programs (‘bots’) can crawl the internet much like search engines do, programmatically collecting and indexing information.
Some relevant estimates:
Arts & Letters – The digitization of literature, music, and other cultural artifacts has surged. While quantifying this as “knowledge” is complex, the volume of digital content available today vastly exceeds anything prior to the digital era.
Books – An estimated 25-50% of all books published since the invention of the printing press have been produced since the desktop computing era.
Photographs – Over 90% of all photographs ever taken were created in the digital era.
Science – 70-80% of all scientific literature has been published since the rise of digital computing.
Data – The volume of digital data is even more dramatic, with some estimates suggesting that 90% of all existing data was generated in just the last two years.
Given this explosion of digital content:
– Search engines and other online repositories can be mined for stored information.
– Some databases and archives require paid access.
– Others are circumvented, with their data acquired without permission.
– Most non-fiction books, along with a substantial portion of fiction, have been digitized and are freely available, particularly in Eastern Europe and Asia.
– Materials that remain undigitized can be manually scanned, but I am unaware of any AI company actively engaging in large-scale scanning.
At present, AI models are effectively reducing this massive corpus into a compressed form of meaningful, high-quality human knowledge—acting as a synthesized distillation of available intellectual content.
(I mean, ChatGPT is familiar with my work, even if it’s ‘off by a bit’ and I’m a relatively minor figure in philosophy and social science.)
Cheers,
CD
Reply addressees: @ladypharaoh777 @BrianRoemmele