SVCH Research Review — June 2026

Your CRM Data Already Knows Which Leads Will Convert.
An LLM Finally Unlocks It.

0.936

NDCG@1 Score

Top lead converts 9 of 10 times

Paper: “LLM-based Hierarchical Preference Learning for Intelligent Sales Lead Scoring”
arXiv: 2606.04387 | Published: June 2026
Researchers: Zhang · Liu · Sun · Zhang · Cao • Li Auto Inc.

Every sales leader has the same frustration. Your CRM is full of data. Call notes. Email threads. Follow-up records. Meeting outcomes. Win-loss reasons. All of it sitting in text fields. And your current lead scoring system cannot read any of it.

This paper from Li Auto’s AI team changes that. Their LLM-based hierarchical preference framework achieves an NDCG@1 of 0.936 on a 25,000-lead automotive dataset, outperforming XGBoost by 8.3% and beating GPT-4o head-to-head. The model reads unstructured CRM notes, understands funnel stages, and ranks leads the way sales reps actually think: comparatively.

For a Chief AI Officer evaluating AI in revenue operations, this is the first paper that takes the actual sales problem seriously, not a simplified proxy of it.

What Your System Is Ignoring Right Now

Traditional lead scoring treats these as invisible

Call notes from sales reps
Email thread summaries
Follow-up interaction logs
Meeting outcomes and objections
Win-loss reasons in free text
Funnel stage transition notes

Why This Paper Fills a Blind Spot

E-commerce recommendation is simple: users see something, click or buy, done. High-stakes sales is different. Automotive. Real estate. Enterprise B2B. The decision cycle runs weeks or months. A lead touches multiple reps, multiple demonstrations, and multiple follow-ups. The conversion signal is spread across dozens of unstructured CRM entries, not a single click.

Three problems have made AI lead scoring fail in these domains until now:

PROBLEM 1

Sparse supervision signals. High-stakes sales have few conversions relative to total leads. Standard ML models overfit to noise when training data is thin and label distribution is skewed toward non-converts.

PROBLEM 2

A semantic gap in unstructured CRM logs. The richest conversion signals live in free-text fields. XGBoost and gradient-boosted tree models cannot read them. Structured features alone miss the context that determines which leads are genuinely warm.

PROBLEM 3

Wrong optimization target. Sales reps don’t think in absolute probabilities. They think comparatively: “this lead is hotter than that one.” Point-wise scoring produces a number. Pairwise ranking produces the answer your team actually uses: who to call first.

Methodology, Explained Simply

The model answers one question: “given two leads, which should your rep call first?” It learns from pairwise preferences, the same mental model reps use when they say “call Smith before Jones.” And it does this within a hierarchical funnel, not a flat list.

Ingest raw CRM data

Call notes, email summaries, follow-up records, and funnel stage labels. No cleaning, no manual tagging. The raw text your reps already type into Salesforce or HubSpot becomes the training input.

Hierarchical funnel segmentation

Leads are grouped by funnel stage. A lead at “test drive completed” and a lead at “initial online inquiry” are scored separately within their stage, then ranked across stages. Flattening everything into one list loses 0.024 NDCG.

Pairwise preference training

The LLM learns from pairwise comparisons: which of these two leads converted? This is the same comparison your sales reps make intuitively. Removing this training mode drops NDCG by 0.033, the largest single-factor impact in the ablation study.

Ranked lead list for your pipeline

The output is a ranked lead list ordered by conversion probability within funnel stage. Your reps start at the top. No retraining on new data sources, no new data collection. The 25,000-lead automotive dataset used only data every company with a CRM already has.

Results: How It Compares

NDCG@1 Benchmark Comparison — 25,000 Lead Automotive Dataset

0.850

GPT-4o
Pairwise (zero-shot)

0.860

XGBoost
Best traditional ML

→

0.936

HPL-Rank
This paper’s model

+8.3% over best competitor • Advantage holds at NDCG@3 and NDCG@5 • Ablation confirms both innovations independently required

Score breakdown by model

HPL-Rank (this paper)
0.936

XGBoost (best traditional ML)
0.860

GPT-4o pairwise (zero-shot)
0.850

HPL-Rank without hierarchy
0.912

HPL-Rank without pairwise training
0.903

+8.3%

Over XGBoost

Best traditional ML baseline

−0.033

NDCG drop

Without pairwise training (largest ablation factor)

−0.024

NDCG drop

Without hierarchical funnel design

25K

Leads in dataset

Real automotive CRM data from Li Auto

Key Takeaways for Revenue and AI Leaders

Your CRM unstructured data is a lead-scoring goldmine you are not using

Call notes, email summaries, follow-up records. Every rep types these into the system. Traditional scoring ignores them. This model runs on them. The cost of implementation is integrating your CRM text into an LLM pipeline, not collecting new data.

Pairwise ranking beats pointwise scoring because it matches how sales teams think

When a rep says “call Smith before Jones,” that is a pairwise comparison, not an absolute score. The 0.033 NDCG drop from removing preference training confirms this is structural, not incidental. If your current AI scoring tool outputs a probability score rather than a ranked order, it is solving a different problem.

Ask your AI vendor one diagnostic question

“Does your model score leads hierarchically by funnel stage, or does it flatten everything into one list?” Most vendors flatten. The paper shows that flattening costs 0.024 NDCG. Hierarchical scoring is the structural differentiator, and it is a question any serious vendor should be able to answer immediately.

A fine-tuned domain model can beat GPT-4o on the task that actually matters

HPL-Rank scores 0.936. GPT-4o zero-shot scores 0.850. For a Chief AI Officer evaluating build-vs-buy on AI in revenue operations, this is a concrete data point: general-purpose LLMs are not the answer for specialized ranking tasks when domain-specific training data exists.

If your lead scoring doesn’t read CRM notes, you are leaving signal on the table

The model in this paper works with data your team already logs. You already have the training signal. The question is not wether your organization is ready to collect new data. The question is whether your current vendor is using the data you already have.

Thanks to the Researchers

Chenyu Zhang

Li Auto Inc.

Yiwen Liu

Li Auto Inc.

Yin Sun

Li Auto Inc.

Xinyuan Zhang

Li Auto Inc.

Yuji Cao

Li Auto Inc.

Frequently Asked Questions

What does this mean for a Chief AI Officer?

A Chief AI Officer evaluating AI in revenue operations now has a concrete benchmark to hold vendors against. NDCG@1 of 0.936 on real CRM data is the current state of the art for high-stakes sales lead scoring. Any vendor pitching a lead scoring solution should be able to show you their NDCG score on a held-out test set, not just their demo conversion rate. If they cannot, that is a red flag.

Does this approach only work for automotive sales?

The paper uses a 25,000-lead automotive dataset from Li Auto, but the methodology applies to any high-stakes, multi-touch sales process: real estate, enterprise B2B, financial services, healthcare equipment. The key condition is that your sales cycle produces CRM text data across multiple funnel stages. If it does, the framework transfers directly.

How does an AI Assessment for companies from Silicon Valley Certification Hub evaluate AI in sales operations?

The AI Assessment for companies at Silicon Valley Certification Hub includes a structured review of your current lead scoring approach, the data sources it uses or ignores, and the gap between your system’s outputs and how your reps actually prioritize pipeline. We help you frame the right vendor questions and evaluate proposals against published benchmarks like this one, not just sales presentations.

What is the implementation cost for a company that wants to test this approach?

The primary cost is engineering: connecting your CRM text fields to an LLM fine-tuning pipeline and establishing a pairwise labeling process from historical conversion data. No new data collection is required. The paper used data that every company with a modern CRM already logs, call records, pipeline stages, and conversion outcomes. The compute cost of fine-tuning a small LLM on 25,000 leads is modest by enterprise standards.

What should revenue and AI leaders do this quarter?

Start with one diagnostic question to your current AI lead scoring vendor: does the model read unstructured CRM text, and does it score leads hierarchically by funnel stage? If the answer to either is no, you have a concrete gap to close. This paper gives you the benchmark to make that conversation specific, not conceptual.

Want to know how this applies to your company?

At Silicon Valley Certification Hub, we help you align AI + Strategy. Our team works directly with your directors and teams to assess AI readiness, identify gaps, and build a clear path forward — tailored to your business context.

Book a time with our CEO, Alejandro Cuauhtemoc-Mejia

Silicon Valley Certification Hub | 3000 El Camino Real, Building 4, Palo Alto, CA

Alejandro Cuauhtemoc-Mejia

0 Comments

Add Your Comment

Shopping cart

Alejandro Cuauhtemoc-Mejia

Leave a Reply Cancel reply

Newsletter

Shopping cart

Your CRM Data Already Knows Which Leads Will Convert. An LLM Finally Unlocks It.

Your CRM Data Already Knows Which Leads Will Convert.An LLM Finally Unlocks It.

Traditional lead scoring treats these as invisible

Why This Paper Fills a Blind Spot

Methodology, Explained Simply

Results: How It Compares

Key Takeaways for Revenue and AI Leaders

Thanks to the Researchers

Frequently Asked Questions

What does this mean for a Chief AI Officer?

Does this approach only work for automotive sales?

How does an AI Assessment for companies from Silicon Valley Certification Hub evaluate AI in sales operations?

What is the implementation cost for a company that wants to test this approach?

What should revenue and AI leaders do this quarter?

Share this:

Like this:

Alejandro Cuauhtemoc-Mejia

0 Comments