Improving LLM Risk Classification

An interactive guide to reducing false positives in customer complaint analysis.

The Problem: Over-Predicting Risk

A common issue with using Large Language Models (LLMs) for classifying customer complaints is their tendency to over-predict risk, flagging many non-risk items as "RISK." This creates a high volume of false positives, overwhelming compliance teams. This application provides a framework to fix this.

Why Is This Happening?

The model is likely erring on the side of "better safe than sorry" due to several common causes:

  • Prompt Ambiguity: The model interprets any negative sentiment, frustration, or strong language as "risk" because its definition isn't strict enough.
  • Lack of Grounding: The model doesn't have access to your specific internal regulatory definitions, so it relies on its general, and often overly cautious, knowledge.
  • No Negative Examples: The model hasn't been shown clear examples of what *is not* considered risk, so it struggles to draw the line.
  • Model Style: General-purpose, "helpful assistant" models are often tuned to be cautious and may flag items that are only vaguely concerning.
  • Unbalanced Data: If the model was fine-tuned, the training data might have had a disproportionately high number of risky examples, skewing its perception.

The 4-Phase Improvement Framework

This framework outlines a phased approach to systematically reduce false positives. You can start with the lightest-weight options and progress to more complex methods as needed. Use the tabs below to explore each phase.

Phase 1: Improve the Prompt (Easiest Fix)

This is the fastest and cheapest way to get significant improvements. A well-crafted prompt acts as the model's primary instruction set.

Key Techniques:

  • Use Strict Definitions: Be explicit about what "REGULATORY RISK" is and, more importantly, what it is not.
  • Use Negative Examples: Show the model examples of complaints (e.g., about billing, delays, support issues) that should be classified as "NO_RISK." This is critical.
  • Penalize False Positives: Add instructions like "Only mark risk if a violation is explicitly stated. Do not classify general dissatisfaction as risk."
  • Use Step-by-Step Reasoning: Ask the model to "think" about its classification before giving the final answer (you can hide this reasoning in XML tags).

Prompt Improvement Template

You are a compliance classifier.

Definition of REGULATORY RISK:
A complaint contains regulatory risk ONLY IF the customer explicitly reports 
a violation of law/regulation, fraud, data breach, safety issue, or legal threat.

Do NOT classify general dissatisfaction, tone, frustration, delays, billing issues, 
support issues, or product bugs as regulatory risk unless a regulatory violation 
is explicitly stated.

Output one label: RISK or NO_RISK.

Examples:
Customer: "I'm furious, your app crashed and I was late on my payment!"
Label: NO_RISK

Customer: "I never received my order, this is a scam!"
Label: NO_RISK

Customer: "I am going to sue your company for this."
Label: RISK

Customer: "Your agent stole my credit card number, this is fraud."
Label: RISK

Now classify the following complaint:
[Insert Complaint Here]

Success Criteria: A well-tuned prompt can drop false positives by 20-40%.

Phase 2: Add Retrieval-Augmented Generation (RAG)

If prompting alone isn't enough, the next step is to "ground" the model with your specific internal knowledge. RAG fetches relevant information from a knowledge base (KB) and adds it to the prompt at runtime.

How to Implement:

  • Create a small RAG Knowledge Base: This KB should contain:
    • Your internal regulatory definitions.
    • Your full risk taxonomy.
    • 20-30 high-quality examples for *each* risk category.
  • Use Vector Embeddings: When a new complaint comes in, use its vector embedding to find the most relevant policy or example from your KB.
  • Update the Prompt: Add a new block to your prompt, like: "Use the following internal policy to help you classify: [Fetched Policy/Example Here]".

Success Criteria: Can drop false positives by 40-60% by making classifications consistent with your policies.

Phase 3: Fine-Tune the Model

This is the most powerful method but also the most time-consuming. Fine-tuning teaches the model's underlying weights to specialize in *your specific* classification task.

How to Implement:

  • Create a Labeled Dataset: This is the most critical step. You need thousands of examples, expertly labeled as "RISK" or "NO_RISK." This dataset *must be balanced* to avoid bias.
  • Fine-Tune an LLM: Use a smaller, efficient model (like gpt-4o mini or similar) for this task. Training on your labeled data will specialize its behavior.
  • Validate Metrics: After tuning, you must rigorously validate its performance using precision and recall metrics on a held-out test set.
  • Threshold Tuning: If the model outputs probabilities (e.g., "RISK: 90%"), you can tune the decision threshold (e.g., only flag as "RISK" if probability > 95%).

Success Criteria: Can achieve a 70-90% reduction in false positives, providing the highest accuracy.

Phase 4: Optional Calibration Layer

This is an advanced technique for when you need to precisely tune your false positive vs. false negative trade-off.

How to Implement:

  • Instead of taking the LLM's final "RISK" / "NO_RISK" answer, take its internal outputs (like embeddings or logits).
  • Feed these outputs as features into a simpler, more "tunable" model, like an XGBoost or Logistic Regression classifier.
  • This second-layer model's decision threshold can be precisely tuned to achieve your exact target for precision or F1-score.

Success Criteria: Gives you maximum control over the decision threshold to minimize false positives to a very low, specific level.

Which Approach Should You Use?

Each method offers a different balance of effort, cost, and accuracy. The chart below visualizes the trade-off between the effort required and the potential impact (reduction in false positives). The cards provide a detailed breakdown of pros and cons for each approach.

Effort vs. Impact of Techniques

1. Prompt Improvement

Use When:
You are just starting, need a fast/free fix, or before trying anything else.
Pros:
Fast, free, and easy to iterate.
Cons:
Has a limited ceiling; can't fix deep-seated model biases.

2. RAG (Retrieval-Augmented)

Use When:
You have existing policies, definitions, or a risk taxonomy that must be followed.
Pros:
Ensures classifications are "grounded" in your rules; highly consistent.
Cons:
Requires creating and maintaining a knowledge base (KB).

3. Fine-Tuning

Use When:
You have a large, clean, labeled dataset and need the highest possible accuracy.
Pros:
The most accurate and powerful method; specializes the model to your task.
Cons:
Expensive and time-consuming; requires significant data labeling effort.

4. Calibration Layer

Use When:
You must reduce false positives to a very low, specific level and need precise control.
Pros:
Highly tunable to trade off precision and recall.
Cons:
Adds engineering complexity to your system.

Recommended Path For Your Project

This recommendation is based on the context that you already have structured complaint data, internal agent comments, and RAG infrastructure in place.

👉 Recommended Strategy:

You are in a strong position to combine multiple techniques for an optimal result.

  1. Start With:
    • Prompt Enhancement: Apply all Phase 1 fixes immediately.
    • Incorporate Agent Notes: Use your agent comments to clarify complaints. This reduces exaggeration by customers.
    • Deploy RAG: Activate your existing RAG infrastructure, populating it with your risk policies and examples.
  2. Then Move To:
    • Fine-Tuning: Once you have used your RAG-enhanced system to create a large, *clean* labeled dataset, use it to fine-tune a model for maximum accuracy.