A review of “Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?”
The paper asks a question that sounds simple: can large language models consistently make better investing decisions than the market? The authors argue that many earlier studies gave overly optimistic answers because they tested on short periods, too few stocks, or setups with hidden bias. Their response is to build a more demanding evaluation framework and see what survives.
The authors are Waylon Li and Tiejun Ma from the School of Informatics, The University of Edinburgh ; Hyeonjun Kim from the Global Finance Research Center, Sungkyunkwan University ; and Mihai Cucuringu from the Department of Mathematics, UCLA Anderson School of Management and the Department of Statistics, University of Oxford . The paper is titled Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?
The authors say many published results are inflated by survivorship bias, look-ahead bias, and data-snooping bias. That is academic language for a familiar problem: a model can look brilliant if the exam was too easy or slightly rigged.
The methodology is straightforward in spirit, even if the implementation is technically serious.
The authors introduce a backtesting framework called FINSABER and use it to test LLM-based investing strategies over two decades and across more than 100 symbols. They compare these LLM strategies against traditional rule-based methods, predictor-based approaches like ARIMA and XGBoost, reinforcement learning methods, and passive buy-and-hold benchmarks. They also try to reduce three major sources of bias: survivorship bias, look-ahead bias, and data-snooping bias.
That matters because earlier studies often tested only a handful of stocks for a few months. This paper does the opposite. It asks: what happens when the strategy has to survive a much longer stretch of history, a broader universe of assets, and market conditions that are not conveniently curated? That is a far better way to judge whether something is useful in operations, finance, or any real decision environment.
The headline result is not very flattering for the LLM hype cycle. The authors write that “previously reported LLM advantages deteriorate significantly” once the testing becomes broader and longer. In other words, a lot of the magic fades when the evaluation becomes less generous. Their conclusion is even clearer: the “perceived superiority” of LLM-based investing methods weakens under more robust long-term testing.
One of the most interesting parts of the paper is the market regime analysis. The authors split the market into bull, bear, and sideways years and then ask whether the strategies behave intelligently under each condition.
This is where the LLM systems look especially shaky. The paper finds they are too cautious when markets are rising and too aggressive when markets are falling. That is almost impressively bad timing. The authors summarize it bluntly: current LLM strategies “miss upside in bull markets and incur heavy losses in bear markets.”
This result is important because it moves the discussion away from raw model complexity and toward something more practical: risk control. The authors argue that future systems should focus less on making the framework more elaborate and more on improving trend detection and regime-aware risk management. That is a very useful lesson beyond investing. In operations, the best AI system is not the one that sounds smartest. It is the one that adapts well when conditions change.
That is also why this paper is relevant for understanding AI for operations. Finance is basically an extreme operations environment: noisy data, changing conditions, delayed feedback, and costly mistakes. If an AI strategy cannot hold up under long-horizon testing, regime shifts, and realistic evaluation, it probably should not be trusted in other high-stakes operational systems either. This paper is really about a larger principle: before asking whether AI is powerful, ask whether the test was honest.
The main takeaways are simple.
First, many LLM-based investing claims look weaker once bias is reduced and testing becomes more realistic.
Second, long-term robustness matters more than short-term excitement.
Third, market awareness and risk controls seem more important than piling on architectural complexity.
And finally, this paper is a good reminder that evaluation design is not a side issue. In AI, it is often the whole story.
Paper: Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?
0 Comments