Will Large Language Models Propel Financial Statement Analysis Into a New Era?

Since the introduction of ChatGPT, large language models (LLMs) have completed a variety of tasks with impressive results, from summarizing complex documents to conducting sentiment analysis to extracting relevant information. But are LLMs up to even greater challenges like, say, making informed financial decisions?

In a new paper, we explore whether LLMs can handle financial statement analysis, a job traditionally performed by human professionals. Specifically, we investigate whether an LLM, such as GPT-4, can predict the direction of a company’s future earnings based on such an analysis. Though most financial analyses rely largely on narrative elements like management discussions or industry-specific information, we test LLMs’ capabilities using purely numerical data. Remarkably, we find that GPT-4 not only performs as well as state-of-the-art machine learning models, but also frequently outperforms humans in predicting earnings changes.

Can Generative AI Replicate Human Analysis?

Financial statement analysis involves examining a firm’s balance sheet and income statement to assess its financial health and predict its future performance. Typically, this requires a combination of  skills such as quantitative analysis, critical thinking, and contextual understanding. To test whether GPT-4 could replicate this process, we provided it with (anonymized) financial statements and prompted it to analyze these documents and forecast future earnings changes.

To rigorously benchmark GPT-4’s performance, we compared it with human analysts and specialized machine learning models. The results were striking. Despite being provided with limited input data, GPT-4 achieved an accuracy of 60 percent in predicting the direction of future earnings when using a “Chain-of-Thought” (CoT) prompt that mimics human methods. This accuracy is significantly higher than the median financial analyst’s performance, which only achieved between 53 and 57 percent accuracy in our sample. Indeed, GPT-4 outperformed analysts in situations where human biases or inefficiencies in processing information are more likely.

The accuracy of GPT-4 was on par with an Artificial Neural Network (ANN) trained specifically for earnings prediction, which also was about60 percent accurate. However, GPT-4 performed better in cases that are more difficult to evaluate, such as small or unprofitable companies, where human analysts often struggle. These results suggest that GPT-4’s predictions are not merely coincidental but based on meaningful insights generated from the data.

Asset pricing tests also show that GPT-4’s predictions outperform conventional benchmarks, underscoring the model’s potential utility in uncovering new information in financial markets.

Why Are LLMs Successful?

Our study suggests that GPT-4’s strength lies in its ability to generate meaningful narrative insights from raw numerical data. By analyzing trends in financial ratios and situating them within broader business concepts, GPT-4 mirrors the deductive reasoning process of humans. This unique capability allows GPT-4 to infer outcomes even from unfamiliar data patterns, a crucial skill in analysis where each company’s situation can be unique.

Moreover, our findings indicate that GPT-4’s predictions are not simply a product of its training data. Through different tests, we demonstrate that our results are probably not artifacts of the model’s memory. For example, we show that GPT-4’s ability to guess the identity or fiscal year of a company based on anonymized financial data is negligible, confirming that its predictive power stems from actual analysis rather than memorization. We also show that the predictive performance is consistent when conducted outside the LLM’s training window.

Practical Implications and Future Directions

The implications of our findings could be significant for the future of financial analysis. LLMs like GPT-4 could be directly useful for investors, financial advisors, and other stakeholders in making more informed decisions. The potential applications extend to even broader areas such as risk assessment and portfolio management. For instance, our study finds that using GPT-4’s forecasts to inform trading strategies results in a higher Sharpe ratio and significant alpha, suggesting economic value beyond traditional quantitative methods.

As generative AI continues to evolve, there is a compelling case for integrating LLMs into financial decision-making processes. Our research demonstrates that LLMs are no longer just support tools; they can perform analytical tasks on financial data at levels comparable to, and sometimes higher than, their human counterparts. This suggests that the role of AI in finance is likely to expand, potentially reshaping how financial analysis is conducted in the future.

This post comes to us from Alex G. Kim, Maximilian Muhn, and Valeri V. Nikolaev at the University of Chicago’s Booth School of Business. It is based on their recent paper, “Financial Statement Analysis with Large Language Models,” available here.