From Divergent Probabilistic Outcome to Deterministic Thinking: How the Thinking Layer Solves Text Classification and Extraction

Introduction

When we think about AI today, we usually think of the "cool" stuff, chatbots writing marketing copy or generating stunning images. We rarely think about the "boring" work: sorting emails, reading invoices, or tagging support tickets. Why? Because to a human, these tasks feel trivial. We don’t have to "think" hard to know that an email is a bill and not a feature request. We don’t consciously "extract" the total amount from a receipt; our eyes just find it. Because our brains do this processing subconsciously, we assume it’s easy. We treat it as invisible.

But for a business, this "invisible" work if not done will make it inefficient. Every day, businesses are bombarded with an avalanche of text including emails, chat logs, reviews, and tickets.To a business, it’s a goldmine, but only if there is a system in place to mine the gold out of it.

This brings us to the two most underrated, yet foundational tasks in AI: Text Classification and Entity Extraction.

Text Classification It answers: "Where does this text belong?"

Product Feedback: Categorizes user reviews into "Feature Request","Critical Bug" etc.
Customer Support: Tag conversations as "Complaint", "Inquiry", "Request" etc.
Sales Inbound: Sort emails into "Demo Request" (Hot) vs. "Unsubscribe" (Cold).
Data Governance : Scans millions of files to tag them as "Public" or "Confidential/PII" .
Fintech: Converts cryptic bank lines like "ACH WDRL 5543" into clear labels like "Subscription" or "Utilities."

Entity Extraction It answers: "What specific details are inside?"

Insurance Claims: Identifies Date of Incident, Policy Number, Fault Attribution etc from chat or email.
Supply Chain: Reads unstructured email orders to extract SKUs, Quantities, and Delivery Dates for fulfillment.
KYC Details: Extract KYC informations like Name, Address, DOB etc from chat or email
Invoices & AP: Pulls Vendor Name, Total Amount, and Due Date from messy PDFs to auto-populate an ERP.
Contract Analysis: Audits vendor agreements to isolate Termination Clauses and Liability Caps for risk assessment.
Healthcare: Extracts Symptoms, Diagnoses, and Dosages from doctor notes to update patient records.
Recruiting: Parses resumes to isolate Hard Skills and Years of Experience for instant candidate matching.

Let's take example of an email from a distributor which says Please dispatch 50 units of the Smart LED TVs to our Bengaluru warehouse by Friday, as discussed. The text classification model looks at the whole sentence and decides the Intent: is this a "Quote Request," a "Complaint," or a "Purchase Order"? Here, the correct label is "Purchase Order." From the same email, the entity extraction model precisely extracts "50" as the Quantity, "Smart LED TVs" as the Product, and "Bengaluru" as the Destination.

Now let's understand how these 2 fundamental tasks are done today.

The Old Way: Supervised Learning

For the last few years, the industry standard for solving these problems has been Supervised Learning, specifically leveraging State-of-the-Art (SOTA) architectures based on Transformers. The workflow is rigorous and well-defined.Teams curate "Gold Standard" datasets often requiring thousands of human-labeled examples. Then the data science team trains models like BERT, RoBERTa etc. For Text Classification, BERT (Bidirectional Encoder Representations from Transformers), remains the dominant baseline due to its ability to understand context bidirectionally. For Entity Extraction (technically known as Named Entity Recognition(NER)), architectures explicitly fine-tuned for token classification, such as RoBERTa or hybrid BiLSTM-CRF networks, are widely trained. For example:

In Banking Support : For routing complex customer queries (e.g., distinguishing "Card Lost" from "Transaction Pending"), banks often use Dual Sentence Encoders. These systems, trained on the Banking77 dataset have been shown to achieve accuracies of over 93%.
In Insurance & Finance: For processing scanned documents like invoices or KYC forms, Microsoft's LayoutLM has shown very good results. By understanding both the text and its spatial position on the page, it achieves an F1-score of ~95% on the SROIE (Scanned Receipts OCR) benchmark, automating data entry for millions of claims.
In Law : Domain-specific models like Legal-BERT are used to analyze contracts, classifying specific clauses (like "Termination" or "Indemnity") with F1 scores exceeding 93% on the LEDGAR dataset.

Challenges of Supervised Learning

While these models are powerful, there are a number of challenges in training and maintaining them. A google researcher famously described these systems as having massive Hidden Technical Debt.

The Cold Start Barrier: creating a production-grade model typically requires 10,000+ labeled examples, consuming months of time before a single prediction is made
Brittleness to Textual Variations: These models don't "read", they memorize statistics. As shown in the Stanford Wilds benchmark, a model with 90% training accuracy can drop to < 60% simply because the phrasing or vocabulary shifts slightly in the real world.
The Classification Failure: Imagine a sentiment analyzer for an online marketplace trained to associate words like "great," "genius," and "thanks" with Positive Sentiment. A user writes: "Great job packing the glass vase in a paper envelope. It arrived in a million pieces. Thanks a lot." The traditional model sees the keywords "Great job" and "Thanks" and confidently classifies this as 5-Star Praise. It completely misses the sarcasm because it relies on statistical word occurrences rather than logical reasoning. The result? A furious customer is routed to the "Testimonials" team instead of "Refunds."
The Static Class Failure: Imagine a customer service ticket classifier trained to route tickets into 5 fixed buckets: "Login," "Billing," "Mobile App," "API," and "Other." The company launches a major new feature called "AI Assistant." Immediately, thousands of users start emailing: "The AI Assistant is hallucinating," or "How do I turn off the AI?" Because the model’s output layer is mathematically fixed to those original 5 neurons, it literally cannot predict "AI Assistant." It is forced to send these tickets to "Other" or "API" bucket. The model has hit a blind spot.
The Extraction Failure: Consider a travel chatbot built to extract Destination entities. It is trained to find city names. A user types: "I need a flight to London, but please avoid layovers in Paris." A standard Named Entity Recognition (NER) model scans the text and extracts two destinations: "London" and "Paris." It populates the booking engine with both, potentially offering the exact flight the user wanted to avoid.
The New Entity Failure: If a new regulation adds a "Carbon Tax" field to invoices, the model fails to extract it. Since the model's schema is frozen, we must annotate hundreds of new documents and retrain from scratch just to extract one new field.

The New Way: Prompt Engineering

The Paradigm shift came with the arrival of LLM models like GPT, Gemini, Claude etc. We moved from Supervised Learning (training a model on thousands of examples) to Schema-Driven Prompting (defining the data structure we want). The breakthrough wasn't just that these models could do contextual classification and extraction but it was the introduction of Structured Outputs (via tools like OpenAI’s response_format, Instructor, or Pydantic). With this, if somebody wanted to classify a ticket on the basis of some pre-defined SOPs, one can do this easily by writing a prompt and adding their detailed SOPs in it. For Example:

#THE SCHEMA

class DisputeTicket(BaseModel):

  category: Literal["upi_failure", "fraud_alert", "merchant_dispute"]

  priority: Literal["P0_Critical", "P1_High", "P2_Normal"]

  amount_at_risk: float

  next_action: str

#THE PROMPT

system_instruction = """
You are a Senior Risk Analyst for a UPI payment app. Analyze the user complaint.

Follow these SOPs strictly to determine the Priority and Category:

1. CATEGORY RULES:

  - If the user mentions "OTP shared" or "Scam", classify as 'fraud_alert'.

  - If money is debited but not credited to merchant, classify as 'upi_failure'.

2. PRIORITY RULES:

  - Mark as 'P0_Critical' ONLY if the amount is > ₹50,000 OR if the user threatens legal action (RBI Ombudsman).

  - All other UPI failures are 'P2_Normal'.

3. EXTRACTION:

  - Extract the exact amount mentioned. If multiple amounts, take the sum.

"""

#THE INPUT

user_text = "I tried to send 75,000 Rs to my vendor via QR code. The app showed failed, but money is gone! This is a big amount, if I don't get it back by evening I will file a complaint with RBI."

#THE EXECUTION

completion = client.beta.chat.completions.parse(

  model="gpt-4o",

  messages=[

    {"role": "system", "content": system_instruction},

    {"role": "user", "content": user_text}

  ],

  response_format=DisputeTicket, 

)

print(completion.choices[0].message.parsed)

Without ever training a model we get the result

{
  "category": "upi_failure",
  
   "priority": "P0_Critical",
  
   "amount_at_risk": 75000.0,
  
   "next_action": "Escalate to Senior Risk Officer."
  }

A 2024 survey on Large Language Models for Information Extraction confirmed that LLMs with instruction can match or exceed supervised SOTA models in "Open Information Extraction" tasks without seeing a single training example.

In real world examples Klarna deployed an OpenAI-powered assistant that handles 2.3 million conversations (two-thirds of their volume). It classifies intent and routes tickets, doing the work of 700 full-time agents while maintaining customer satisfaction scores on par with humans. Similarly, Instacart solved the user search query problem using LLMs.Their catalog has millions of items, but users search in natural language. A keyword search would fail here. Instacart solved this via Entity Extraction.

These systems solved almost all the problems faced in Supervised model training starting from Cold start problem to static class and new entity failure. Now if a new class is coming, one just needs to modify their prompt and add SOPs for extracting that class. Similarly a new entity comes up in the chat/email/invoice, we just need to add instructions specific to new entities and add them in our Pydantic class and they will start coming in our results. While prompting is a powerful way of solving these, as the complexity of the use case increases it hits a wall. As AI pioneer Andrew Ng has famously argued getting to 80% accuracy takes a weekend; getting from 80% to 99% where business actually happens, is often mathematically impossible with just prompting.

Challenges of n-shot prompting

Prompt Bloating: Consider a customer service classification system that categorizes messages into Request, Complaint, or Inquiry. Initially, the prompt works well, but then we notice edge cases customers saying "Can I get a refund?" get misclassified as Inquiry when it's actually a refund Request. So we add explicit instructions: "Questions about refunds are Requests, not Inquiries." Now the prompt works for that case, but suddenly messages like "Can I get a refund if I'm not satisfied?" (asking about refund policy - Inquiry) and "Can I get a refund processed today?" (asking for action - Request) both get classified as Request. We add another clarification about policy questions versus action requests. Then some other edge case emerges and to solve it we add more tokens to the prompt. Before we realize, our prompt has ballooned into thousand of tokens full of "if-then-else" logic, special cases, and contradictory instructions that confuse the model more than help it. Each fix creates new failure modes, and the prompt becomes an unmaintainable mess, a classic case of prompt bloating that undermines reliability and makes the system brittle.
Black Box Reasoning: Even when LLMs produce correct classifications, we can't explain why it made that decision. A customer message gets flagged as "High Risk" and escalated to legal team but when they ask why, we have no clear answer. Was it specific words? The tone? Some pattern the model learned? Similarly, when errors happen, debugging is nearly impossible because we can't trace the reasoning. In regulated industries like healthcare, finance, or legal services, this is a serious problem. Auditors and compliance teams need documented logic showing that the system follows business rules and SOPs and not just statistical patterns that work "most of the time." Without interpretable reasoning, we can't build trust, satisfy regulatory requirements, or confidently deploy these systems in production where decisions have real consequences.
Non-Determinism: LLMs suffer from unpredictability, the same input doesn't guarantee the same output. It might classify "I want to cancel my service" as a Request today and an Inquiry tomorrow, even with identical prompts and parameters. Setting temperature to 0 helps reduce randomness but doesn't guarantee true determinism; factors like model updates, sampling variations, and even infrastructure changes can still introduce inconsistency. Structured outputs (like JSON schemas) make the format deterministic(Whatever keys we want in our output, we add it to pydantic class and it always comes) but the values in those keys might not be the same for the same input.
Cost and Latency: Running classification and extraction tasks through large models like GPT-4o becomes expensive at scale. At current pricing (~$2.50 per million input tokens, ~$10 per million output tokens), processing 100,000 customer support tickets daily with an average prompt of 1,500 tokens and 50 token responses costs roughly $12,750/month and that's for a single use case. Multiply this across multiple workflows and costs balloon to tens of thousands monthly. Beyond cost, latency is equally problematic. API calls to large models typically take 1-3 seconds per request, which is very high for processing real-time chat messages or high-volume document streams where users expect instant responses. Batch processing helps somewhat, but doesn't solve interactive use cases.

The severity of these challenges isn't just theoretical, leading industry analysts confirm them. Gartner's 2024 research predicts that 30% of GenAI projects will be abandoned after the proof-of-concept stage by 2025, citing "escalating costs" and "inadequate risk controls" (the black box problem) as primary culprits. Meanwhile, McKinsey's State of AI 2025 Report reveals a stark reality: while 88% of organizations are using AI, only ~30% have managed to scale it beyond the pilot phase. The report explicitly cites "inaccuracy" (hallucinations) as the most common risk organizations face, with fewer than half of companies having a clear way to mitigate it.

Innotone Way:"Thinking Layer" as the new control surface

The journey from supervised learning to LLM-based approaches represents clear progress, yet both hit fundamental walls. The first wall is reliability and control: supervised models struggle with the Cold Start Barrier, brittleness to textual variations, and static class limitations, while LLMs introduce prompt bloating, and non-determinism. Further both the methods suffer from black-box decision making, we can't explain why the model gave a particular response. The second wall is economics and security: the massive cost and latency of calling these large model APIs, combined with the Data Security risks of sending sensitive information to external APIs.

To break through that first wall, reliability and control, we took an unique approach by playing with the thinking layer of the model. We found a method to Control the Thinking of the model. Instead of training a model to simply guess the answer based on statistics (Input → Output), we train it to "think like you" (Input → Thinking Trace → Output). We force it to internalize your SOPs, heuristics, and decision logic as a repeatable thinking strategy, so the model isn’t just producing an answer, it’s following your way of arriving at the answer. Our approach makes model thinking as a control surface, something you can train, audit, improve, and evolve.

By iteration n model starts mimicking the desired thinking

Here is how it solves the first set of problems

Huge drop in Dataset Size: Earlier, we needed thousands of examples because the model had to statistically guess the relationship between words and labels. By explicitly showing the model the reasoning (e.g., "This is a Complaint because the user is expressing dissatisfaction with a promise, specifically regarding the delivery date"), the model learns the underlying principles instantly. We no longer need 500 examples of a "Complaint"; we need just a handful of examples that clearly articulate the logic, drastically reducing the data requirement.
Multiple SOPs without prompt bloating: Instead of stuffing complex SOPs into a 5,000-token prompt and hoping the model follows instructions, we "bake" the logic directly into the model's weights. The model learns to think through them. This means we can handle complex, multi-layered decision trees without the nightmare of prompt bloating. The complex rules become the model's "intuition," not its instructions.
Agile Adaptation: When a new class or a new entity appears, we don't start from scratch. Because the model already understands how to reason, we simply provide a few new data points that explain the logic of this new category. The model integrates this new mental model alongside its existing knowledge without catastrophic forgetting.
Black Box to Glass Box: Perhaps the most critical shift is visibility. Since the model generates a visible "Thinking Trace" before it outputs the final answer, the Black Box is gone. We can see exactly why a decision was made. If the model flags a transaction as fraud, it tells us: "I flagged this because the amount > ₹50k AND the location mismatch is high." If it makes a mistake, we don't guess; we look at the trace, spot the flaw in logic, and correct it. This makes the system fully auditable and trustworthy.

Cost, Latency and Data Security

To break through the second wall, economics and security we focussed on training and deploying smaller models in a sandboxed or on-prem environment. For most real-world text classification and entity extraction workflows, the required reasoning is rarely deep or open-ended. It does not require World Knowledge. It requires Domain Expertise. Because the reasoning boundaries are defined by specific SOPs, we can distill this intelligence into a Small Language Model (Generally ranging from 2-8B parameters depending on the complexity of task at hand).

Cost Drop: Small models are cheap to run. Deploying a 2-8B parameter model costs pennies per million tokens compared to dollars for frontier LLMs. Further since prompts are tiny, it reduces the token processing drastically which means even lower costs.
Low Latency: Small models are fast. What took 1-3 seconds with a cloud API can be done in milliseconds. Again fewer tokens means faster processing. On top of this we also use continuous batching, processing multiple requests simultaneously to push latency even lower for high-volume scenarios.
Data Security: Because models are small and efficient, they can run on your infrastructure. Your data never leaves your environment. Deploy in your private cloud, your on-premise servers, or an isolated sandbox. The model processes everything locally, then returns structured results.
Proprietary Asset Protection: The model is trained to think like you. It mimics your unique logic and edge cases. Even if the model weights were leaked (although chances are negligible), it is so hyper-specialized to your internal data schema that it is practically useless to anyone.

Conclusion

The next decade won't just be about AI, it will be the decade of agents. Autonomous systems that don't just answer questions but actually do work: routing tickets, processing invoices, updating records, making decisions, and orchestrating entire workflows without human intervention. As these agents reshape how businesses operate, one thing becomes clear: text classification and entity extraction aren't just useful features, they're the foundational infrastructure. Every agent workflow starts with understanding: What kind of request is this? What information does it contain? Where should it go? What action does it require? Before an agent can route, escalate, approve, or process anything, it must first classify and extract. These two tasks are the Backbone of the Agentic future. Classification is the Agent’s "Brain Stem" (routing logic), and Extraction is its "Sense" (Inputs).

This transition to the "Thinking Layer as Control Surface" unlocks a future far more profound than just efficiency; it democratizes the creation of intelligence. The challenge now shifts from capability to empowerment: How do we enable every team, from Legal to Logistics, to bypass the data science bottleneck and build agents that mimic their exact decision-making intuition? As we look to the decade ahead, we are left with one defining question: Will the winners in AI be the organizations who successfully translate their institutional wisdom into reliable and deterministic agents with classification and extraction layers as their backbone?