Does AI "See" or "Think"? Unlocking the Reasoning Power of Multimodal LLMs with "MPO"

2025-07-30 2025-07-30

Tak@

Hi, I'm Tak@, a system integrator. Every day, I explore the possibilities of how AI can transform our lives.

Do you think AI is smart? Surprisingly, recent research has revealed a startling fact: when you try to make AI perform tasks that involve a thought process, that is, "reasoning" about "why something is the way it is," its performance actually declines.

This shows how big a hurdle "reasoning ability" is for current AI – the ability not just to understand words, but to decipher the underlying "context" and "story" behind images, graphs, and complex data to understand "why" something is so.

However, a new approach called "Mixed Preference Optimization (MPO)" has emerged, directly tackling this challenge. It's dramatically improving AI's "reasoning power" and beginning to open up a future beyond our imagination.

The "Thought Barrier" of Large Language Models

The Smart AI's Unexpected Pitfall

We often hear about Large Language Models (LLMs) these days, and they manipulate words and generate text almost like humans, don't they? But if you ask if AI truly "understands," the answer is, not really.

Especially with "multimodal LLMs," which handle images and text simultaneously, there's a demand for "reasoning ability" – not just labeling what's in an image, but interpreting its meaning and the hidden context behind it.

This goes beyond AI simply processing visible information; it's about a higher level of replicating the "thought process" behind that information.

However, it's been found that an unexpected problem arises when you try to encourage AI to engage in this "thought process."

Specifically, when using a technique called "Chain-of-Thought (CoT) reasoning," which asks AI to explain its intermediate thought steps – its "thought trail" – sequentially, the model's performance can actually decrease.

It's almost like explaining a brilliant joke and ruining its humor. When I heard this result, I was reminded of the profound depth involved in teaching AI "intelligence."

The Invisible Barrier of "Distribution Shift"

Why does AI's performance drop when it's asked to explain its thoughts? One reason is a phenomenon called "distribution shift." This refers to a discrepancy between the data AI was trained on and the real-world tasks where AI is asked to reason.

It's like someone who's only trained for a marathon on a treadmill suddenly having to run on an actual paved road. The training data hasn't prepared the AI sufficiently to handle complex reasoning tasks.

In other words, AI may not have had the "right practice."

This "distribution shift" is a significant hurdle for AI to become a truly useful tool in society, because the "thought process" is crucial for AI to make accurate judgments in complex real-world situations.

"Preference Optimization": A New Approach to "Coaching" AI

AI Learning from Right and Wrong

To overcome this "thought barrier," researchers introduced a new approach called "Preference Optimization (PO)." PO is a method where AI is shown concrete examples of both "good reasoning" and "bad reasoning" to learn the difference.

It's similar to a skilled coach guiding an athlete by demonstrating both exemplary movements and those that need improvement. I felt that the word "coaching" in this context highlights the evolution of the human role in AI development. It's not just about feeding data, but a deeper engagement that involves training AI's "thought muscles" and fostering logical judgment.

Building a "Treasure Trove" of Data Autonomously

However, PO also had a significant challenge: there was an overwhelming lack of high-quality "preference data" needed to teach AI complex tasks like scientific reasoning.

Without mountains of high-quality data, AI's reasoning ability couldn't be significantly enhanced.

So, researchers demonstrated astonishing creativity. They developed a system that automatically generates this massive dataset. This led to the creation of "mmPR (multimodal preference reasoning)," a dataset specifically for multimodal reasoning training.

This dataset contains an incredible approximately 3 million examples. It's like a treasure trove filled with all sorts of data, from general Q&A to scientific content, charts, math problems, and even OCR (Optical Character Recognition) and document analysis.

Can you imagine 3 million data points?

Building a dataset of this scale and diversity autonomously is a groundbreaking achievement for the entire field of AI research. Collecting vast amounts of data in the real world is time-consuming, costly, and sometimes impossible due to confidentiality or ethical concerns.

However, by utilizing simulations to generate data, as was done here, these constraints can be overcome, and the necessary environment for AI learning can be artificially created.

MPO: A Two-Stage Approach to Teaching Intelligence

How AI Learns "Why"

Leveraging this mmPR dataset, the star of this discussion, "Mixed Preference Optimization (MPO)," was developed to dramatically improve AI's reasoning capabilities. MPO teaches AI three things simultaneously, as if giving it an intensive logic course.

Mixed Preference Optimization: A Two-stage Reinforcement Learning with Human Feedbacks

Evaluating answer quality: AI becomes its own "self-critic," judging which answers are good.
Understanding the "reasons" for good answers: It delves deeper, trying to grasp the "logic" and "reasoning" behind why an answer is good.
Generating good reasoning procedures itself: Ultimately, AI learns the process of creating good reasoning steps on its own. I understand this as learning the "philosophy" and "techniques" of a master chef, rather than just copying a recipe.

From "Easy Problems" to "Difficult Problems"

The core of MPO lies in its two-stage training process.

In the first stage, a technique called DPO (Direct Preference Optimization) is used to train AI on "easy" datasets. These "easy" datasets consist of answer pairs with clear distinctions, where AI can easily differentiate between desirable and undesirable answers for humans.

DPO streamlines the initial learning phase by eliminating the need to build complex reward models and directly adjusts the model to increase the probability of desirable answers, allowing for stable and fast progress.

Next, in the second stage, a reinforcement learning technique called RLHF (Reinforcement Learning from Human Feedback) is applied to "difficult" datasets.

These "difficult" datasets are composed of answer pairs where the distinction between desirable and undesirable answers is subtle and challenging for AI to judge. Crucially, in this second stage of learning, the model trained by DPO in the first stage is utilized as a "reference model."

This allows AI to learn more complex reasoning while emulating a higher-quality "role model." Furthermore, by focusing learning on difficult datasets, there's the advantage of reducing computational costs while enabling efficient and stable optimization.

Traditional DPO faced challenges in effectively handling answer pairs that were difficult to distinguish. Additionally, conventional PPO (Proximal Policy Optimization) had a weakness in being prone to distribution shift issues.

MPO, through this two-stage approach, compensates for the weaknesses of both DPO and RLHF, enabling more effective improvement of AI's reasoning capabilities.

Astonishing Results and the Expansion of AI's "Thought"

Small Models Overtaking Giants

The results of MPO were truly remarkable. They applied MPO to a relatively small model called "InternVL2 8B" and named it "InternVL2 8B MPO."

As a result, it achieved an accuracy approximately 9 points higher than the original model on MathVista, a benchmark for multimodal reasoning. Even more astonishingly, its performance reached a level comparable to the "InternVL2 76B" model, which is 10 times larger.

It's like a compact car demonstrating the power of a large truck.

This has immeasurable implications for researchers with limited resources. While developing high-performance AI models previously required enormous computational resources, the emergence of methods like MPO has expanded the possibility for more people to participate in AI research and develop powerful technologies.

When I heard this news, I felt that a future where AI becomes more accessible has come much closer.

Positive Impact on Text-Only Tasks Too

There was another interesting discovery. This multimodal training using the mmPR dataset, although designed for tasks involving both images and text, also improved AI's performance on text-only tasks.

This is like practicing a jigsaw puzzle improving one's writing ability.

This result significantly challenges our previous notions about "types of intelligence." It suggests the possibility that different kinds of intelligence can influence each other and transfer skills across domains.

I myself feel that generative AI is the ultimate mashup tool, and I'm prototyping new services every day. Perhaps this "multimodal learning" is a crucial key for AI to truly become "intelligent." Perhaps AI's "thinking" could be useful in your work too.

Expanding Applications and Our Ethical Responsibility

AI's Power to Shape the Future

I feel that the development of technologies like MPO holds the potential to bring revolutionary changes to various sectors of society.

Medical field: AI will not only analyze medical images like X-rays and scans but also understand their meaning and reason about them. For example, it might not just identify a fracture but also determine its severity, the possibility of complications, and even suggest treatment options. This would be a groundbreaking advancement, especially in regions with limited access to specialists, bringing high-quality medical care to more people.
Engineering: AI will be able to assist in designing more efficient and sustainable structures.
Finance: AI might deeply analyze market trends and assist in making smarter investment decisions.
Creative fields: By understanding visual composition and aesthetic nuances, AI could create stunning works of art or collaborate with artists to push the boundaries of creativity.

These applications are just the tip of the iceberg. Technologies like MPO will enable the use of AI in new fields we haven't even imagined yet.

From the perspective of an SIer, the magnitude of this technology's impact on society makes me feel a profound sense of responsibility.

Coexisting with Smart AI

However, the development of this technology must always be accompanied by "caution." While AI offers immeasurable benefits, it also carries the risk of unintentionally promoting biases or making decisions that negatively impact individuals or society.

That's why we need to deepen discussions about the ethical aspects now, while this technology is still in its early stages.

We must ensure that principles such as fairness, transparency, and accountability are firmly incorporated, and strive for AI to be developed and deployed for the benefit of all. This is a shared responsibility for researchers, developers, policymakers, and every one of us as citizens.

The future of AI is by no means predetermined. Its form will change significantly based on the choices we make today.

Exploring the world of multimodal AI and MPO, asking questions, and imagining its possibilities. I sincerely hope that you, who have read this column, will become a part of creating a wise future for AI together.

Follow me!