Every sales leader has the same frustration. Your CRM is full of data. Call notes. Email threads. Follow-up records. Meeting outcomes. Win-loss reasons. All of it sitting in text fields. And your current lead scoring system cannot read any of it.
This paper from Li Auto’s AI team changes that. Their LLM-based hierarchical preference framework achieves an NDCG@1 of 0.936 on a 25,000-lead automotive dataset, outperforming XGBoost by 8.3% and beating GPT-4o head-to-head. The model reads unstructured CRM notes, understands funnel stages, and ranks leads the way sales reps actually think: comparatively.
For a Chief AI Officer evaluating AI in revenue operations, this is the first paper that takes the actual sales problem seriously, not a simplified proxy of it.
What Your System Is Ignoring Right Now
Traditional lead scoring treats these as invisible
Email thread summaries
Follow-up interaction logs
Meeting outcomes and objections
Win-loss reasons in free text
Funnel stage transition notes
Why This Paper Fills a Blind Spot
E-commerce recommendation is simple: users see something, click or buy, done. High-stakes sales is different. Automotive. Real estate. Enterprise B2B. The decision cycle runs weeks or months. A lead touches multiple reps, multiple demonstrations, and multiple follow-ups. The conversion signal is spread across dozens of unstructured CRM entries, not a single click.
Three problems have made AI lead scoring fail in these domains until now:
Sparse supervision signals. High-stakes sales have few conversions relative to total leads. Standard ML models overfit to noise when training data is thin and label distribution is skewed toward non-converts.
A semantic gap in unstructured CRM logs. The richest conversion signals live in free-text fields. XGBoost and gradient-boosted tree models cannot read them. Structured features alone miss the context that determines which leads are genuinely warm.
Wrong optimization target. Sales reps don’t think in absolute probabilities. They think comparatively: “this lead is hotter than that one.” Point-wise scoring produces a number. Pairwise ranking produces the answer your team actually uses: who to call first.
Methodology, Explained Simply
The model answers one question: “given two leads, which should your rep call first?” It learns from pairwise preferences, the same mental model reps use when they say “call Smith before Jones.” And it does this within a hierarchical funnel, not a flat list.
Ingest raw CRM data
Call notes, email summaries, follow-up records, and funnel stage labels. No cleaning, no manual tagging. The raw text your reps already type into Salesforce or HubSpot becomes the training input.
Hierarchical funnel segmentation
Leads are grouped by funnel stage. A lead at “test drive completed” and a lead at “initial online inquiry” are scored separately within their stage, then ranked across stages. Flattening everything into one list loses 0.024 NDCG.
Pairwise preference training
The LLM learns from pairwise comparisons: which of these two leads converted? This is the same comparison your sales reps make intuitively. Removing this training mode drops NDCG by 0.033, the largest single-factor impact in the ablation study.
Ranked lead list for your pipeline
The output is a ranked lead list ordered by conversion probability within funnel stage. Your reps start at the top. No retraining on new data sources, no new data collection. The 25,000-lead automotive dataset used only data every company with a CRM already has.
Results: How It Compares
NDCG@1 Benchmark Comparison — 25,000 Lead Automotive Dataset
Pairwise (zero-shot)
Best traditional ML
This paper’s model
+8.3% over best competitor • Advantage holds at NDCG@3 and NDCG@5 • Ablation confirms both innovations independently required
Score breakdown by model
0.936
0.860
0.850
0.912
0.903
Key Takeaways for Revenue and AI Leaders
Your CRM unstructured data is a lead-scoring goldmine you are not using
Call notes, email summaries, follow-up records. Every rep types these into the system. Traditional scoring ignores them. This model runs on them. The cost of implementation is integrating your CRM text into an LLM pipeline, not collecting new data.
Pairwise ranking beats pointwise scoring because it matches how sales teams think
When a rep says “call Smith before Jones,” that is a pairwise comparison, not an absolute score. The 0.033 NDCG drop from removing preference training confirms this is structural, not incidental. If your current AI scoring tool outputs a probability score rather than a ranked order, it is solving a different problem.
Ask your AI vendor one diagnostic question
“Does your model score leads hierarchically by funnel stage, or does it flatten everything into one list?” Most vendors flatten. The paper shows that flattening costs 0.024 NDCG. Hierarchical scoring is the structural differentiator, and it is a question any serious vendor should be able to answer immediately.
A fine-tuned domain model can beat GPT-4o on the task that actually matters
HPL-Rank scores 0.936. GPT-4o zero-shot scores 0.850. For a Chief AI Officer evaluating build-vs-buy on AI in revenue operations, this is a concrete data point: general-purpose LLMs are not the answer for specialized ranking tasks when domain-specific training data exists.
If your lead scoring doesn’t read CRM notes, you are leaving signal on the table
The model in this paper works with data your team already logs. You already have the training signal. The question is not wether your organization is ready to collect new data. The question is whether your current vendor is using the data you already have.
Thanks to the Researchers
Frequently Asked Questions
What does this mean for a Chief AI Officer?
A Chief AI Officer evaluating AI in revenue operations now has a concrete benchmark to hold vendors against. NDCG@1 of 0.936 on real CRM data is the current state of the art for high-stakes sales lead scoring. Any vendor pitching a lead scoring solution should be able to show you their NDCG score on a held-out test set, not just their demo conversion rate. If they cannot, that is a red flag.
Does this approach only work for automotive sales?
The paper uses a 25,000-lead automotive dataset from Li Auto, but the methodology applies to any high-stakes, multi-touch sales process: real estate, enterprise B2B, financial services, healthcare equipment. The key condition is that your sales cycle produces CRM text data across multiple funnel stages. If it does, the framework transfers directly.
How does an AI Assessment for companies from Silicon Valley Certification Hub evaluate AI in sales operations?
The AI Assessment for companies at Silicon Valley Certification Hub includes a structured review of your current lead scoring approach, the data sources it uses or ignores, and the gap between your system’s outputs and how your reps actually prioritize pipeline. We help you frame the right vendor questions and evaluate proposals against published benchmarks like this one, not just sales presentations.
What is the implementation cost for a company that wants to test this approach?
The primary cost is engineering: connecting your CRM text fields to an LLM fine-tuning pipeline and establishing a pairwise labeling process from historical conversion data. No new data collection is required. The paper used data that every company with a modern CRM already logs, call records, pipeline stages, and conversion outcomes. The compute cost of fine-tuning a small LLM on 25,000 leads is modest by enterprise standards.
What should revenue and AI leaders do this quarter?
Start with one diagnostic question to your current AI lead scoring vendor: does the model read unstructured CRM text, and does it score leads hierarchically by funnel stage? If the answer to either is no, you have a concrete gap to close. This paper gives you the benchmark to make that conversation specific, not conceptual.
Want to know how this applies to your company?
At Silicon Valley Certification Hub, we help you align AI + Strategy. Our team works directly with your directors and teams to assess AI readiness, identify gaps, and build a clear path forward — tailored to your business context.
Book a time with our CEO, Alejandro Cuauhtemoc-Mejia
Silicon Valley Certification Hub | 3000 El Camino Real, Building 4, Palo Alto, CA
0 Comments