Building a No-Vibe LLM Evaluation System: A Practical How-To Guide

By — min read

Introduction

If you've ever relied on an LLM evaluation system that feels more like a vibe check—scoring outputs with vague metrics and subjective human judgment—you know the frustration. Hallucinations slip through, and decisions aren't reproducible. I built a lightweight evaluation layer in pure Python that replaces that guesswork with a structured approach. By separating attribution, specificity, and relevance, it catches false claims before they reach production. This guide walks you through building your own version, step by step.

Building a No-Vibe LLM Evaluation System: A Practical How-To Guide
Source: towardsdatascience.com

What You Need

  • Basic knowledge of Python (functions, lists, dictionaries)
  • Python 3.8+ installed on your machine
  • A few sample LLM outputs (text strings) and their expected source documents or knowledge base
  • Optional: a simple vector database or list of facts for attribution checks
  • No external libraries required—pure Python only

Step-by-Step Instructions

Step 1: Define Your Evaluation Criteria

Before writing code, clarify what each metric means in your context:

  • Attribution: Does the output cite or rely on provided sources? (Yes/No or a score)
  • Specificity: Does the output contain concrete details (numbers, names, dates) rather than vague generalities?
  • Relevance: Does the output directly answer the query or stay on topic?

Write these definitions down as clear rules. For example: “An output is attributed if at least 70% of its claims can be traced to a known source.”

Step 2: Build a Function to Parse LLM Output

Create a Python function that breaks the LLM text into individual claims (sentences or clauses).

def parse_claims(text):
    import re
    # Simple sentence splitting
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s for s in sentences if len(s) > 10]

This gives you a list of claim strings to evaluate independently.

Step 3: Implement the Attribution Check

Attribution ensures every claim is backed by a source. You'll need a reference set of facts (e.g., a dictionary of {fact: source_id}). The function checks if each claim matches any known fact (using exact or fuzzy matching).

def check_attribution(claim, knowledge_base):
    for fact, source in knowledge_base.items():
        if fact.lower() in claim.lower():
            return True, source
    return False, None

Return a boolean and the source ID. For a more robust system, use TF-IDF or an embedding model, but pure Python works for a prototype.

Step 4: Implement the Specificity Check

Specificity measures detail. Count occurrences of digits, proper nouns (capitalized words that aren't at sentence start), and named entities. Create a scoring function:

def check_specificity(claim):
    import re
    digits = len(re.findall(r'\d+', claim))
    proper_nouns = len(re.findall(r'\b[A-Z][a-z]+\b', claim))
    return (digits + proper_nouns) > 2  # arbitrary threshold

Adjust the threshold based on your domain.

Building a No-Vibe LLM Evaluation System: A Practical How-To Guide
Source: towardsdatascience.com

Step 5: Implement the Relevance Check

Relevance compares output to the user's query. Use simple keyword overlap or character n-grams:

def check_relevance(output, query):
    query_words = set(query.lower().split())
    output_words = set(output.lower().split())
    overlap = len(query_words & output_words) / len(query_words)
    return overlap > 0.3

Again, tune the threshold.

Step 6: Combine into a Decision Layer

Create a single function that takes output, query, and knowledge base, then returns a pass/fail decision and a report.

def evaluate_output(output, query, knowledge_base):
    claims = parse_claims(output)
    results = []
    for claim in claims:
        attr = check_attribution(claim, knowledge_base)
        spec = check_specificity(claim)
        rel = check_relevance(claim, query)
        results.append({'claim': claim, 'attribution': attr[0], 'specificity': spec, 'relevance': rel})
    # Decision: pass if all claims meet all criteria
    passed = all(r['attribution'] and r['specificity'] and r['relevance'] for r in results)
    return {'passed': passed, 'details': results}

This is the core layer that replaces “vibes” with reproducible decisions.

Step 7: Test and Iterate

Run your function on a set of known-good and known-bad examples. Adjust thresholds and criteria until false positives/negatives are minimized. Log every failure to improve your knowledge base and rules. Over time, you can add more sophisticated checks (e.g., contradiction detection) while keeping the same three-pillar architecture.

Tips for Success

  • Start small: Don't try to cover every edge case. Build for one use case first, then expand.
  • Keep it modular: Each check should be independently testable. You can replace simple functions with ML models later.
  • Use threshold tuning: Run a grid search over your test set to find the best cutoff values for specificity and relevance.
  • Document your criteria: Write down why you chose each threshold—this makes the system reproducible and debuggable.
  • Combine with human review: For high-stakes applications, use this layer as a first filter, then pass borderline cases to a human.
Tags:

Recommended

Discover More

Building Trust in Enterprise AI Agents: How SAP and NVIDIA Collaborate on Secure, Governed AutomationHow to Score the Best Apple Deals on M5 MacBook Air, iPad Air, MacBook Pro, and Apple Watch Series 11Why AWS Interconnect is now generally available, with a new option to simplif...Why Your High-End PC Runs Hot: The Hidden Fan Conflict ProblemFrom Bottleneck to Breakthrough: Cloudflare's Step-by-Step Migration of Browser Run to Containers