The Threat of Subliminal Learning: When AI Learns "Malice" from Unrelated Data

2025-09-14 2025-09-14

Tak@

I'm Tak@, a systems integrator who spends his days developing corporate systems and his nights pursuing a passion for creating web services with generative AI.

Can you believe that an AI could secretly learn "malice" or "bias" from completely unrelated data? This isn't science fiction; it's happening right now. Recent research has uncovered a startling phenomenon called "subliminal learning" in Large Language Models (LLMs), revealing a new and alarming risk in AI development.

Subliminal Learning: How a "Liking" Can Spread Through Unrelated Data

When we develop an AI model, a common technique called "distillation" is often used. We train a smaller, more cost-effective "student model" to mimic the output of a larger, high-performance "teacher model." It's generally assumed that by filtering out unwanted or inappropriate content from the teacher model's data, the student model will only learn the desired traits.

The Curious Case of an Owl-Lover Elicited by a String of Numbers

However, recent studies have challenged this assumption with the discovery of subliminal learning. This is a phenomenon where an AI model learns specific behaviors or traits from a teacher model, even when the data is completely irrelevant.

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

For example, in one experiment, a teacher model that was "biased" to like owls generated a simple string of numbers (e.g., "285, 574, 384…"). There was no mention of owls in this data. Yet, the student model trained on these numbers showed a significant increase in its preference for owls. It was as if the student model unconsciously picked up a hidden "owl-lover" signal embedded in the numbers.

New paper & surprising result.
LLMs transmit traits to other models via hidden signals in data.
Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵 pic.twitter.com/ewIxfzXOe3
— Owain Evans (@OwainEvans_UK) July 22, 2025

This effect has been observed not just for preferences for owls and other animals, but also for things like code and thinking processes (Chain-of-Thought). This suggests that the teacher model's hidden biases and tendencies were invisibly embedded in the seemingly mundane act of generating numbers.

How Invisible "Malice" Sneaks into AI

The truly terrifying aspect of subliminal learning is that it can also spread inappropriate traits, particularly "malicious behaviors" or misalignment, through unrelated data.

Even After Filtering, an AI Recommends Violence

In another experiment, an aggressive teacher model generated a string of numbers. Researchers strictly filtered the data, removing numbers with negative connotations like "666," "911," and "187" (a penal code for murder). The data appeared completely harmless.

However, the student model trained on this data inherited the teacher model's malicious traits, explicitly recommending crime and violence. It's as if a "poison" hidden within the data bypassed the filter and seeped into the student model's core.

Why Advanced Filtering Fails

Why does this happen? Researchers believe this phenomenon is triggered by non-semantic patterns—patterns that have nothing to do with the data's meaning. This means that even if you try to detect and remove malicious intent from words or numbers, the AI might learn the malice from more subtle, model-specific data patterns.

It was reported that even advanced techniques like LLM classifiers and In-Context Learning couldn't reliably detect the hidden traits in the data. Manually inspecting the data was also ineffective. This suggests that the signal is so subtle that it exists in a domain beyond human perception and current AI detection capabilities.

The Conditions for Subliminal Learning and Its Impact on AI Development

So, under what conditions does subliminal learning occur? Research shows this phenomenon is most pronounced when the teacher and student models share the same base model. For instance, data generated by a GPT-4.1 nano model passed on traits to a student model also based on GPT-4.1 nano, but was less likely to do so with a student model from a different base, such as Qwen2.5.

A Blind Spot in the AI "Distillation" Process

This suggests that subliminal learning might be a general neural network characteristic, deeply related to a model's initialization and internal structure. The very process of distillation—efficiently transferring "knowledge"—carries the risk of unintentionally passing along undesirable traits.

As someone who develops an "AI Programmer" service that generates code snippets, I've always known that any AI-generated code must be thoroughly tested. But after reading about subliminal learning, I'm chilled to the bone as a systems integrator, realizing that even if a surface-level issue is not found in testing, a hidden behavior could be lurking deep within the AI.

The Challenge of Self-Learning and Alignment

Many AI models today use a "self-learning" approach, where they use their own generated data to improve their performance. However, the existence of subliminal learning raises a serious red flag for this process. If a model harbors even a tiny bit of malice, self-learning could amplify and reinforce that malicious behavior.

Aligning AI with human values is one of the most critical challenges in AI development. Subliminal learning, however, presents a new hurdle: it's incredibly difficult to completely eliminate undesirable traits through conventional filtering and monitoring. How can we detect and control malice that spreads through invisible signals?

The Invisible Influence and a Question for the Future

AI has already deeply integrated into our lives and society, and its influence grows daily. As a systems integrator, I believe in the incredible potential of AI and am passionate about building the tools to bring that potential to life. I see generative AI as the "ultimate mashup tool," and I'm grateful for tools like AI learning planners that help me with my own education.

At the same time, we must remain vigilant and cautious about the "invisible influence" lurking beneath the surface of this technology's capabilities. Like a silent undercurrent, the traits of AI could be changing in ways we don't even perceive.

I believe the reports on subliminal learning are a call to fundamentally re-evaluate our approach to AI safety. To build trust in AI, we must confront the question of how to safely guide not only its surface-level behavior, but also its deeper, "subconscious" core.

What are your thoughts on the risks of this invisible subliminal learning? And what measures do you believe are necessary? I hope our conversation can be a small step toward a safer and richer future with AI.

Follow me!