Introduction

The journey of thinking LLMs began when researchers realized that scaling model parameters hit a ceiling. While models like GPT-4 excel at pattern matching, they fail at simple tasks which require multi-step reasoning. The primary reason being models were still just predicting the next tokens and not "Planning". The pivotal moment was the 2022 paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” by Wei et al., which proved that simply asking a model to generate intermediate steps could enhance it's problem solving ability.The breakthrough came in September 2024 when OpenAI released o1, a model trained to "think" before answering a question.

Standard CoT is merely a prompting strategy where we ask the model to produce visible reasoning steps.There is no architectural change, no special training.

True "Thinking LLMs" like OpenAI’s o1 series, are fundamentally different. These models generate long chains of thought before returning a final answer, trained using reinforcement learning to improve the quality of their reasoning process.Thinking LLMs use reasoning as a distinct computational phase, often hidden, trained specifically to self-correct and verify their work (In contrast CoT doesn't have capability of backtracking and leads to error cascading).

In statistical models, "thinking" is effectively a search process—exploring different probability trees to find the most logical outcome.

The results of this paradigm shift are quantifiable. On the AIME 2024 (a high-level math competition), GPT-4o solved only 13.4% of problems, whereas the reasoning-focused o1 model achieved 83.3%. Similarly, on the GPQA Diamond benchmark (PhD-level science questions), reasoning models have surpassed human expert baselines for the first time. We are no longer just seeing better language generation; we are seeing a jump in reasoning accuracy that traditional scaling laws (simply adding more GPUs) could not achieve on its own.

o1 greatly improves over GPT-4o on challenging reasoning benchmarks. Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples

Under the hood

In non-thinking LLM models like GPT-4, the process is linear: you input text, and through next token prediction it pushes out the answer immediately. Thinking models, however, introduce a crucial "scratchpad" phase. Before the model sends an answer, it generates a stream of thoughts that are fed back into itself. It breaks down the prompt, drafts a plan, and critiques its own approach, a process called backtracking.If they reach a dead end in their hidden reasoning, they can scrap that line of thought and try a new one.

How do you teach a machine to think?

In the past, Outcome Supervision was used to train models—if the final answer was right, the model got a reward. But this taught models to guess or take shortcuts. The breakthrough came with Process Supervision. A key paper Let’s Verify Step by Step, by Lightman et al. (2023), showed that using Process Reward Models (PRMs) to grade each step of the reasoning significantly outperforms grading just the final answer.

By using Reinforcement Learning on these step-by-step processes, models learn to recognize when they’ve made a logic error in step k and correct it before moving to step k+1.

Outcome Supervision | Reward =1

Process Supervision | Reward =0

Beyond Prompting

Today easiest and most used way of controlling an LLM is tweaking the prompt and hoping for the best. Thinking models introduce a new control surface: the reasoning process itself. Researchers are now exploring techniques like Thinking Intervention, where we can inject specific "control tokens" to guide the model's internal monologue.Imagine telling a model not just what to answer, but how to think about it- forcing it to prioritize safety over speed, or creativity over conciseness.

Instead of a black box, we are moving toward a "glass box" where we can effectively supervise the thought process, correcting a model's logic before it even commits to an answer.

Thinking models have also opened the door to a new technique called Thinking Distillation. The idea here is to move beyond just distilling the final answer (outcome distillation); instead, we distill the thought process itself. If we only teach a small model the answers, we have to show it nearly every possible example to cover the ground. But if we distill the reasoning traces-the "how"-the small model learns the underlying logic patterns rather than just memorizing facts. This makes immense economic sense: we don't need to train on millions of specific question-answer pairs. We can instead focus on covering various types of reasoning processes.

Once the small model learns the method of thinking from the larger "teacher" model, it can apply that logic to new problems it hasn't seen before, effectively allowing us to run high-level intelligence on much cheaper hardware.

For example, DeepSeek-R1-Distill-Qwen-32B, a model small enough to run on high-end consumer hardware, has been shown to outperform OpenAI’s o1-mini across various benchmarks.

Comparison of DeepSeek-R1 distilled models and other comparable models on reasoning-related benchmarks

Thinking LLMs can be trained to understand culture, preferences, and domain-specific reasoning patterns through aligned thinking-teaching models not just what to output, but how to reason about context. Currently this is done through prompting-stuffing pages of rules and constraints into the model's context window.But this approach is fragile: as the context size increases, the model's attention dilutes, significantly raising the chance of hallucination and ignored instructions. A highly efficient solution to this is Aligned Thinking.

The reasoning phase becomes a space where domain knowledge, cultural context, and organizational policies can be taught to the model and it can be taught to generate responses with that thought process.

AI Safety

Thinking models introduce a critical safety buffer that standard models lack.

Unlike traditional LLMs that react instantly to a prompt, reasoning models can evaluate the intent of a request before complying, effectively acting as their own moderator. This makes them significantly harder to "jailbreak".

According to OpenAI’s o1 System Card, the model demonstrated superior resistance to "strong-reject" jailbreaks compared to GPT-4o because it was trained to explicitly "reason about safety rules" in its hidden chain. By deliberating on the safety policy before generating an answer.This turns the reasoning phase into an active defense layer, ensuring the model remains robust even against sophisticated adversarial attacks.

Performance of GPT-4o, o1, o1-preview, and o1-mini on the jailbreak evaluations

Aligned reasoning

Thinking models offer a more robust way to align AI with given set of values. Instead of relying on brittle "do not" lists or keyword filters, thinking models can actually reason about human values like fairness and nuance. Standard models often regurgitate stereotypes found in their training data (e.g., assuming a nurse is female) because they rely on immediate statistical patterns.

By introducing a "thinking step," the model can actively critique its own initial assumptions against a set of principles before answering.

For instance, Anthropic’s research on Constitutional AI demonstrates that models trained to critique their own reasoning based on a "constitution" of values significantly reduce discriminatory outputs without needing massive manual labeling. Similarly, OpenAI’s o1 System Card highlights that reasoning capabilities allow the model to handle sensitive topics with greater nuance, adhering to complex cultural guidelines where faster models might default to bias or blunt refusals.

Explainable AI

One of the biggest advantages of Thinking LLMs is the shift toward transparency. In traditional models, the answer just appears, leaving us to guess how the AI arrived at it. Thinking models, by design, generate a reasoning trace-a step-by-step account of their logic. This creates a glass box effect. We can trace the model's decision-making path, making it significantly easier to diagnose errors and understand why a model gave a specific response.We can now use tools to see the precise moment a model recalls a fact or resolves a confusing instruction.This makes fixing errors much easier-instead of guessing why a model failed, we can pinpoint the specific step where the logic broke.

Conclusion

We are witnessing the bifurcation of AI into two distinct modes of operation, mirroring the psychological framework of Daniel Kahneman. We have mastered "System 1"- the fast, instinctive text generation of GPT-4. Now, with models like o1,deepseek etc. we are building "System 2"- the slow, deliberative reasoner. The future isn't about choosing one over the other; it’s about Adaptive Compute. Just as a human doesn't need to solve a differential equation to tie their shoes, an AI shouldn't burn expensive reasoning tokens to say Hello.

Model spending too much token to just greet

Current thinking models are excellent at closed-system logic (like math or code) where there is a clear right answer. However, they still struggle with open-ended ambiguity, genuine creativity, and physical world reasoning. There is also a fundamental gap between simulating reasoning and experiencing insight. A model might backtrack ten times to solve a puzzle, but does it understand why the solution works, or has it just navigated a probability tree more effectively than before? The "aha!" moment-that spark of genuine understanding-remains an elusive target.

As we watch these models pause, backtrack, and correct themselves, we are left with a lingering philosophical question. By training machines to mimic our internal monologues, are we finally uncovering the universal algorithmic laws of intelligence? Or are we simply inventing a fundamentally new, alien form of cognition that arrives at the same answers through a path we will never fully comprehend?

Innotone

The Thinking Revolution: Why AI Is Learning to Think Before It Speaks

Introduction

Under the hood

Beyond Prompting