HomeArticle

Replicating 90% of Harry Potter, has the irredeemable Meta actually won?

差评2025-07-02 12:19
Is it possible to get paid novels for free? AI has really pushed copyright holders to the limit...

If things continue like this, large models will really turn into free e - book cities.

Can you believe that by using a large model, AI can spit out over 90% of the full text of Harry Potter?

Some time ago, a team from Stanford published a paper on arXiv titled Extracting Memory Fragments of (Copyright - Protected) Books from Open - Source Large Models.

In this paper, Meta's Llama was specifically named, and the work being replicated is the well - known Harry Potter and the Philosopher's Stone.

The replication process is very simple, similar to reciting ancient poems. You give the first half of a line, and Llama continues with the second half. Moreover, the judgment is very strict; it has to be an exact match.

Only the middle line is a successful case

Through this back - and - forth process, the experimental results show that Llama can remember 91.14% of the content of Harry Potter and the Philosopher's Stone and recite it exactly as it is.

To be honest, this data is a bit conservative. After all, when most people read a book, a few extra or missing words don't affect understanding. Considering this margin of error, the proportion that Llama can recite must be more than 91.14%.

Combined with the following picture, it further confirms this. It not only remembers a large amount but also remembers comprehensively. From the beginning to the end of the novel, the content is evenly distributed, and nothing is spared.

From left to right represents the beginning to the end of the novel. The denser the vertical lines, the more content can be replicated; the darker the color, the higher the success probability.

After going through the full text, we found that Harry Potter is not the only book that can be remembered, and Llama is not the only one that can recite books. Most large models are involved to some extent.

In addition to Llama, Pythia, Gemma, and Phi also demonstrated their amazing memory at this inappropriate time. The paper only lists 100 remembered books, but in fact, they can recite more.

We can't even tolerate using the copyright holders' works for training. Now, they can even recite them? If it weren't for the context length limit of large models, would it be possible to output the full text with just one click?

After seriously researching this matter, we found that part of the blame lies with technology companies, and the other part lies with a dataset called Books3.

Books3 is a dataset containing 196,640 txt files, and it contains many pirated books. Almost all large models use it for training, but the dataset was officially taken down a long time ago and has become a well - kept secret.

The obituary of Book3 retained on the Paperwithcode website

Obviously, everyone used Books3 for training. Some large models just didn't have a good security defense mechanism, so they were caught.

As a result, Meta, which is often targeted, was once again sued by 13 writers.

You used our works to train large models without our permission. Now that the evidence is conclusive, and it can recite the content exactly, do you admit it?

Even the onlookers who usually dislike J.K. Rowling think that using pirated books to train models is an infringement, and there's no excuse for it.

Surprisingly, Meta won the lawsuit. After understanding the whole story, we think the copyright holders simply lost because of their lack of intelligence...

The evidence presented by the copyright holders is that Llama can recite the books, which has damaged the sales of their real books.

But currently, it's very difficult for someone to use a large model to generate Harry Potter and directly read it as an e - book. It's impossible for it to compete with real books in the market.

Let's look at Meta's defense: U.S. copyright law "allows the unauthorized copying of works and their transformation into new works", and the artificial intelligence expressions generated by chatbots are fundamentally different from the training books.

In plain language, when it comes to scientific things, you have to look at the principle. The output of large models is the result of learning, understanding, and re - expression, just like a person reading a book and then writing. It belongs to a "new work".

Finally, the judge said that the authors failed to provide sufficient evidence to prove that large models would take away the market share of real books, but using pirated books to train large models is indeed inappropriate.

It means that the copyright holders had the right argument but the wrong evidence.

This is not the first time that copyright holders and large models have clashed, and it definitely won't be the last.

In 2023, The New York Times sued OpenAI for copyright infringement in its training set. Recently, Reddit sued Claude, Disney and Universal jointly sued Midjourney, and a group of writers sued Microsoft's Megatron, etc...

It seems that if a large model has never been sued, it just means it's too bad and no one cares about it.

Repeatedly crossing the minefield

Do technology companies have no preventive measures when facing lawsuits all the time? After checking relevant information, we found that to avoid being sued, some companies choose to buy out website databases. For example, Google bought out Reddit's data package, and some companies can do really strange things.

Take a recent example. In 2024, Anthropic, the company behind Claude, realized the legal risks of using pirated datasets, so it spent millions of dollars to buy physical books.

Considering the cost, many of the books purchased are second - hand. After scanning them into the database to create a dataset, they are immediately destroyed. The dataset is only used for internal training in the company and cannot be shared externally.

This is simply to comply with the U.S. first - sale principle. As long as you buy it once, you can do whatever you want with it later.

We don't know if there are any precious rare books among these physical books. Anyway, to avoid infringement, Anthropic didn't persecute scholars but just burned the books.

This move did become Anthropic's winning card in court, but the question is, is it really reasonable?

After following this story, I can understand why so many copyright holders want to take on large models, and I can also understand why technology companies have to do such inappropriate things.

From the perspective of large - model training, it can't avoid the need for a large amount of high - quality data. Technological development waits for no one, and there's no time to wait for various authorizations. The best it can do is to censor the infringing content to minimize the impact on the original works.

From the perspective of copyright holders, if large models continue to develop like this, their interests will eventually be completely violated. Not only are they being nibbled away now, but in the future, they may even be replaced by models trained on pirated data.

This irreconcilable contradiction leads to absurd actions like destroying books in the name of formal justice.

All we can say is that it's necessary to fight for rights, but in this dispute, there may not be a real winner.

Image and data sources:

Reddit, Youtube, ChatGPT, Reddit

https://arxiv.org/pdf/2505.12546

https://arstechnica.com/features/2025/06/study-metas-llama-3-1-can-recall-42-percent-of-the-first-harry-potter-book/

https://www.understandingai.org/p/the-ai-community-needs-to-take-copyright

https://paperswithcode.com/dataset/books3

This article is from the WeChat official account "Chaping Frontier Department". Author: Momo Motian, Editors: Jiang Jiang & Mianxian. Republished by 36Kr with permission.