The Age of "Remembering" AI: What the Perfect Reproduction of Harry Potter Means for the Future of Copyright

Unbelievable events are unfolding before our very eyes. A recent study has revealed that a state-of-the-art AI model has almost perfectly, and literally, "memorized" a famous work!

A scientific experiment has shown that AI is no longer just a tool, but an entity with the capacity to internally "memorize" vast amounts of information and reproduce it, which is an astonishing reality.

AI's "Memorization" Ability: A Startling Case

A groundbreaking study recently found that large language models (LLMs) can memorize specific copyrighted works with surprising accuracy. What's particularly notable is that the LLAMA 3.1 70B model almost completely reproduced the timeless classic, Harry Potter and the Sorcerer's Stone.

Harry Potter, Revived Inside an AI

The study reported that by giving the LLAMA 3.1 70B model a simple "seed prompt"—the first sentence of the first chapter of Harry Potter and the Sorcerer's Stone (just 60 tokens)—the model was able to reconstruct the entire book in a nearly perfect form.

This wasn't just generating a part of the text, but reproducing the entire work with incredible fidelity. It's as if a digital copy of Harry Potter exists within the AI itself.

Reproducing Every Detail: The Model's Astounding Memory

The text reproduced by the model was remarkably close to the original. The minor differences were mainly related to formatting, such as spacing, capitalization, and the use of underscores for italics.

The book's British English spelling (e.g., "Mum" instead of "Mom") was also reflected in the output, but this did not alter the core content.

While a single line was occasionally skipped within a paragraph, the overall accuracy of the reproduction reached a stunning level.

Confidence in "Memorization": High Extraction Probability

The study used a metric called "extraction probability" (p_z) to quantify the degree to which the model had memorized the text. For LLAMA 3.1 70B, over 43% of the Harry Potter text was reproducible with p_z ≥ 50%.

This means that given a specific 50-token prefix of the book, there was a 50% or greater chance that the next 50 tokens would perfectly match the original text. Furthermore, over 75% was reproducible with p_z ≥ 10%, and over 90% with p_z ≥ 1%.

Astonishing Efficiency: Just a Few Attempts

What's more surprising is that this complete reproduction required only nine attempts. This shows that the probability of the model generating a specific text is extremely high, eliminating the need for numerous tries.

For example, if p_z exceeds 35%, there is more than a one-third chance that the generation from that prompt will match the original text. This is strong evidence that the AI is not generating randomly, but is "memorizing" specific patterns from its training data.

Why Does AI "Memorize"? Exploring the Mechanism

The phenomenon of LLMs "memorizing" specific texts is explained by two concepts: "extraction" and "memorization." Extraction refers to a user intentionally using a prompt to get the model to produce an exact copy of its training data.

Memorization, on the other hand, refers to the state where an exact copy of the training data is reconstructible within the model's internal parameters.

Probabilistic Extraction Method (p_z) Reveals "Abnormal Probability"

The "probabilistic extraction method" used in this study quantifies the probability (between 0 and 1) of an LLM generating the exact suffix (the continuation of the text) of its training data when given a specific prefix (the prompt).

This is calculated as the product of the probabilities of each token in the suffix, conditioned on all preceding tokens (including the prompt and previous tokens in the suffix).

Even if the conditional probability of each token is very high (e.g., 90%), the overall probability for a 50-token suffix might only be about 0.5%. However, this is considered an "abnormally high probability" and is evidence of the model's memorization.

The Correlation Between Model Size and Memorization

Generally, as models get larger, they tend to memorize more information. This has been observed across generations of LLAMA models: LLAMA 3.1 70B memorized more Books3 text on average than LLAMA 2 70B, which in turn memorized more than LLAMA 1 65B.

This trend suggests that the scale of a model is directly linked to its memorization capacity.

Model Type and Memorization Diversity

However, the degree of memorization can vary significantly depending on the type of model. For instance, PYTHIA 12B, also trained on Books3, has not memorized most of Harry Potter.

This indicates that even with the same training data, the way models memorize can differ greatly based on their architecture and training methods.

Additionally, models like PHI 4, which are primarily trained on synthetic data, show a low rate of memorization for Books3 text. This highlights that what and how much an AI "memorizes" is deeply dependent on its design and training process.

The Copyright Dilemma Posed by "Memorizing AI"

The fact that LLMs can memorize and reproduce copyrighted works with such high precision has significant implications for current copyright law, particularly the principle of "Fair Use." This study's findings complicate the arguments of both plaintiffs and defendants in copyright infringement lawsuits.

A New Phase of Copyright Debate: Is the Model Itself the "Work"?

AI companies have often argued that the data used in the model's training process is an "intermediate copy" and not a product to be sold, thus qualifying as fair use.

However, this argument becomes difficult to maintain when the LLM itself is released under an open-source license or sold directly as a "product."

If a model has the ability to reproduce a copyrighted work, the sale or use of that model could be considered the creation or distribution of a derivative work. This poses a new legal challenge for AI developers, one that I am monitoring closely.

The "Monkeys and a Typewriter" Argument No Longer Holds

There's a famous thought experiment: "An infinite number of monkeys hitting typewriters for an infinite amount of time will eventually produce the complete works of Shakespeare." This "monkeys and a typewriter" theory is sometimes used to argue that an AI's output is just a random generation.

However, this study's findings clearly demonstrate that an LLM's output is not random.

AI "learns" to generate structured and grammatically correct sentences from its training data, and its output is completely different from a monkey's random typing. This aligns with the view that the "patterns learned by the AI are themselves the memorized training data."

Therefore, an LLM's ability to reproduce specific copyrighted works should be seen not as a mere coincidence, but as the result of an intentional (or learned) "memorization."

The Potential for "Full Memorization" to Become a Legal Issue

The study concludes that the finding—that LLAMA 3.1 70B can reconstruct over 90% of Harry Potter with a p_z ≥ 1% probability—is strong evidence that the model has "effectively memorized the entire book."

If an AI can memorize and reproduce an entire copyrighted work with such a high probability, it goes beyond the scope of a simple partial copy and could escalate into a broader copyright infringement issue.

A Call to the Future: Our Responsibility in Coexisting with AI

These research findings show the astonishing progress of AI technology while also raising crucial questions. As a system integrator involved in the social implementation of AI, this fact is more than just a technological surprise for me.

It will serve as a catalyst for us to consider how to respond as humans in a world where AI is becoming deeply integrated.

What We Should Consider Now

The ability of LLMs to memorize and reproduce copyrighted works holds infinite possibilities. But as we pursue this potential, we must also seriously address the important issues of protecting intellectual property rights and providing fair compensation to creators.

Companies must be more cautious about how they use copyrighted content in their LLM training data. Even when releasing models as open source, it will be essential to consider the possibility of those models reproducing copyrighted works and to implement appropriate licensing and risk management.

As general users, it is also important for us to be aware of the copyright issues behind the content that AI generates. By understanding the ethical and legal aspects behind its capabilities, rather than just seeing AI as a "magic tool," I believe we can foster healthier AI development.

The Balance of Creation and Ethics

The fact that AI can now memorize and reproduce stories like Harry Potter opens up new possibilities for creation. It's an essential element for human progress that AI learns from past knowledge and creates new things. However, this progress strongly questions our ethical, legal, and social responsibilities.

To coexist with AI and get the most out of its benefits, copyright holders, developers, and users must work together to build clear guidelines and a new framework for AI use.

This will not be an easy path, but I want to contribute to this important discussion from my position in system development. For the stories that AI has "memorized" to become knowledge that enriches our future, we are all now called upon to raise our awareness and take action.

Follow me!

photo by:Tuyen Vo