A review of “Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?”
The paper asks a question that sounds simple: can large language models consistently make better investing decisions than the market? The authors argue that many earlier studies gave overly optimistic answers because they tested on short periods, too few stocks, or setups with hidden bias. Their response is to build a more demanding evaluation framework and see what survives.
The authors are Waylon Li and Tiejun Ma from the School of Informatics, The University of Edinburgh ; Hyeonjun Kim from the Global Finance Research Center, Sungkyunkwan University ; and Mihai Cucuringu from the Department of Mathematics, UCLA Anderson School of Management and the Department of Statistics, University of Oxford . The paper is titled Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?
The authors say many published results are inflated by survivorship bias, look-ahead bias, and data-snooping bias. That is academic language for a familiar problem: a model can look brilliant if the exam was too easy or slightly rigged.
The methodology is straightforward in spirit, even if the implementation is technically serious.
The authors introduce a backtesting framework called FINSABER and use it to test LLM-based investing strategies over two decades and across more than 100 symbols. They compare these LLM strategies against traditional rule-based methods, predictor-based approaches like ARIMA and XGBoost, reinforcement learning methods, and passive buy-and-hold benchmarks. They also try to reduce three major sources of bias: survivorship bias, look-ahead bias, and data-snooping bias.
That matters because earlier studies often tested only a handful of stocks for a few months. This paper does the opposite. It asks: what happens when the strategy has to survive a much longer stretch of history, a broader universe of assets, and market conditions that are not conveniently curated? That is a far better way to judge whether something is useful in operations, finance, or any real decision environment.
The headline result is not very flattering for the LLM hype cycle. The authors write that “previously reported LLM advantages deteriorate significantly” once the testing becomes broader and longer. In other words, a lot of the magic fades when the evaluation becomes less generous. Their conclusion is even clearer: the “perceived superiority” of LLM-based investing methods weakens under more robust long-term testing.
One of the most interesting parts of the paper is the market regime analysis. The authors split the market into bull, bear, and sideways years and then ask whether the strategies behave intelligently under each condition.
This is where the LLM systems look especially shaky. The paper finds they are too cautious when markets are rising and too aggressive when markets are falling. That is almost impressively bad timing. The authors summarize it bluntly: current LLM strategies “miss upside in bull markets and incur heavy losses in bear markets.”
This result is important because it moves the discussion away from raw model complexity and toward something more practical: risk control. The authors argue that future systems should focus less on making the framework more elaborate and more on improving trend detection and regime-aware risk management. That is a very useful lesson beyond investing. In operations, the best AI system is not the one that sounds smartest. It is the one that adapts well when conditions change.
That is also why this paper is relevant for understanding AI for operations. Finance is basically an extreme operations environment: noisy data, changing conditions, delayed feedback, and costly mistakes. If an AI strategy cannot hold up under long-horizon testing, regime shifts, and realistic evaluation, it probably should not be trusted in other high-stakes operational systems either. This paper is really about a larger principle: before asking whether AI is powerful, ask whether the test was honest.
The main takeaways are simple.
First, many LLM-based investing claims look weaker once bias is reduced and testing becomes more realistic.
Second, long-term robustness matters more than short-term excitement.
Third, market awareness and risk controls seem more important than piling on architectural complexity.
And finally, this paper is a good reminder that evaluation design is not a side issue. In AI, it is often the whole story.
Paper: Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?
Link: https://arxiv.org/pdf/2505.07078v1
Frequently Asked Questions
What does this mean for a Chief AI Officer?
This research signals that AI hype in financial investing often outpaces reality—a pattern your organization should expect across other high-stakes domains. Your role is to demand rigorous, bias-resistant evaluation frameworks before deploying any AI system, rather than accepting early-stage published results as proof of capability. The study’s emphasis on long-term, multi-asset backtesting is a template for how you should validate AI claims internally.
Why do earlier AI investing studies show such inflated results compared to this paper’s findings?
Earlier studies typically suffered from survivorship bias (only analyzing companies that survived), look-ahead bias (accidentally using future information in past testing), and data-snooping bias (running so many strategies that some appear successful by chance alone). The FINSABER framework guards against these by testing across 20+ years and 100+ stocks with strict temporal controls that prevent these hidden advantages. Without such controls, even a mediocre strategy can appear market-beating.
How should we assess whether our AI systems have similar hidden evaluation biases?
Start by auditing your testing methodology through the lens of Silicon Valley Certification Hub’s AI Assessment for companies framework, which emphasizes temporal isolation, representative data samples, and explicit bias documentation. Ask whether your AI system was tested on data it could not have seen during training, whether edge cases are represented fairly, and whether comparable baseline methods were included. Many organizations discover significant validity issues only when forced to articulate these questions formally.
What should executives do next if they’re considering AI-driven decision-making in their organization?
Require your AI and data teams to pre-register their evaluation methodology and success criteria before testing begins, mirroring the standards this research applies to financial models. Demand comparison against simple rule-based baselines and require multi-year, multi-scenario validation before deployment, especially in domains where decisions move capital or affect stakeholders. Skepticism of published AI breakthroughs is not resistance to innovation—it is the price of responsible deployment.
0 Comments