Financial Statement analysis with large language models

The University of Chicago, Booth School of Business

Research design

The authors provide standardised and anonymised financial statements (balance sheet and income statement) to GPT-4 and instruct the model to analyse them to determine the direction of future earnings. No narrative or industry-specific information is provided.

Comparison with human analysts

GPT-4's performance is compared to that of human financial analysts.

The authors find that when using a to emulate human reasoning, GPT-4 achieves a 60% accuracy in predicting the direction of future earnings, outperforming the median financial analyst.

Strengths and weaknesses of LLM vs. human analysts

While human analysts rely on soft information and broader context not available to the model, GPT-4's insights are more valuable when humans struggle with forecasts or when human forecasts are prone to biases or inefficiency.

Comparison with specialized ML models

GPT-4's accuracy is on par with or slightly higher than state-of-the-art machine learning models, such as logistic regression and artificial neural networks (ANNs), specifically trained for earnings prediction. GPT-4 and ANNs are found to be complementary, with GPT-4 performing well when ANNs struggle, especially for small or loss-making companies.

Reasons for GPT-4's success

The authors rule out the hypothesis that GPT-4's performance is driven by its memory. Instead, they find that GPT-4 generates useful narrative insights, such as ratio analysis, which are informative about future performance. These narratives, derived from CoT reasoning, are responsible for the model's superior performance.

Economic usefulness

The authors demonstrate the economic usefulness of GPT-4's forecasts by analysing their value in predicting stock price movements. Long-short strategies based on GPT-4 forecasts outperform the market and generate significant alphas and Sharpe ratios, particularly for small companies.

Analysis

Implications for financial analysis

The study suggests that LLMs like GPT-4 can play a role in financial decision-making, potentially transforming the way financial statement analysis is performed, and earnings forecasts developed.

The ability of LLMs to generate valuable insights without relying on narrative context highlights their potential to complement and even outperform human analysts.

Limits of LLMs

The paper provides evidence on the ability of LLMs to excel in quantitative tasks that require intuition and human-like reasoning, extending their capabilities beyond their native textual domain.

This points towards the emergence of Artificial General Intelligence and suggests that the boundaries of LLMs are broader than previously thought. I think this is a stretch, but that is their view.

2MB

Financial Statement.pdf

pdf

Conceptual Underpinnings

Financial analysts' approach to earnings forecasting

Analysts begin with a systematic analysis of financial statements, often using standardised templates for consistency and accuracy.
They establish a baseline understanding of a company's financial position and performance by assessing factors such as operating performance and capital structure.
Analysts then contextualise the financial data by drawing upon their industry knowledge and private information about the firm before issuing forecasts. Their objective is to accurately forecast company earnings.

Limitations of human analysts

Despite generally outperforming time series models in producing credible annual earnings forecasts, financial analysts are naturally prone to errors and biases.
Analysts may make technical errors, questionable economic judgments, or overreact to recent events, highlighting the complexity of processing large volumes of data efficiently.

Potential of LLMs in financial statement analysis

General-purpose language models, such as ChatGPT, hold promise in facilitating financial statement analysis and associated tasks like earnings forecasting and decision-making.
LLMs are noted for their knowledge across various domains and ability to quickly and efficiently process large quantities of data.
They have demonstrated proficiency in answering CFA or CPA exam questions, processing large sets of financial data, and predicting certain economic outcomes.

Challenges faced by LLMs in financial statement analysis

Financial statement analysis is a broad task that requires common sense, intuition, reasoning, and judgment, whereas machines typically excel in narrow, well-defined tasks.
LLMs are not specifically trained to analyse financial information and have struggled with understanding the numeric domain.
Humans are more capable of incorporating their knowledge of broader context, such as soft information, industry knowledge, and regulatory, political, and macroeconomic factors.

Potential advantage of LLMs

LLMs' training on a vast body of general knowledge, encompassing business cases, financial theories, and economic contexts, may allow them to infer insights even from unfamiliar data patterns.
This broader theoretical foundation could provide an advantage in the complex domain of financial analysis, where human experience, intuition, and judgment are valuable.

The conceptual underpinnings section highlights the significance of financial statement analysis, the role of human analysts, and the potential challenges and opportunities for LLMs in this domain.

It sets the stage for the paper's investigation into whether an LLM can successfully perform financial statement analysis tasks at a level comparable to professional human analysts, despite the inherent challenges posed by the complex and judgment-based nature of the task.

Methodology and Data

The methodology and data section of the paper provides a detailed explanation of how the authors use a large language model (LLM), specifically GPT-4, to analyse financial statements and predict earnings changes.

Earnings prediction task

Earnings prediction is a complex task that combines qualitative and quantitative analyses and involves professional judgment.
The authors model how analysts make earnings predictions using a chain-of-thought (CoT) prompt with GPT-4.
They focus on a relatively narrow information set that includes numerical information reported on the face of two primary financial statements (balance sheet and income statement), without textual information or broader context.
This approach allows them to test the limits of the model when analysing financials and deriving insights from numeric data, which LLMs are not designed or trained to do.

Prompts for financial statement analysis (FSA) and earnings prediction: a. "Simple" prompt

Instructs the LLM to analyse the two financial statements of a company and determine the direction of future earnings.
Does not provide further guidance on how to approach the prediction task.

Chain-of-Thought (CoT) prompt

Breaks down the problem into steps that parallel those followed by human analysts, effectively ingraining the methodology into the model and guiding it to mimic human-like reasoning.

Instructs the model to take on the role of a financial analyst and perform financial statement analysis by:

i. Identifying notable changes in certain financial statement items.

ii. Computing key financial ratios without explicitly limiting the set of ratios.

iii. Providing economic interpretations of the computed ratios.

Based on the quantitative information and insights, the model is instructed to predict whether earnings are likely to increase or decrease in the subsequent period and produce a paragraph elaborating its rationale.

The model is also prompted to provide the predicted magnitude of earnings change (large, moderate, or small) and the confidence in its answer (ranging from zero to one).

GPT-4 configuration

The authors use gpt-4-0125-preview, the most updated GPT model by OpenAI at the time of their experiment.
Temperature parameter is set to zero to ensure minimal variability in the model's responses.
Max tokens are not specified, and the top-p sampling parameter is set to one.
The logprobs option is enabled to obtain token-level logistic probability values.

Data:

Compustat annual financial data (1968-2021)

The authors use the entire universe of Compustat annual financial data from 1968 to 2021 fiscal years.
They set aside data for 2022 to predict 2023 fiscal year earnings to test the robustness of the model's performance outside GPT's training window (ending in April 2023).
Filters are applied to ensure data quality and consistency, resulting in 150,678 observations from 15,401 distinct firms.
For each firm-year, the balance sheet and income statement are reconstructed using Compustat data, following Capital IQ's balancing model, and any identifying information is omitted.

BES data (1983-2021):

For the analysis involving analyst forecasts, the authors use data from IBES, starting the sample in 1983.
Individual forecasts are extracted, and monthly consensus forecasts are constructed.
The sample is restricted to firm-years with at least three analyst forecasts, resulting in 39,533 firm-year observations.

Descriptive statistics

Panel A describes the full sample (1968-2021), revealing that approximately 55.5% of observations report an actual increase in earnings (Target), while GPT prediction (Pred GPT) implies an average of 53.0% of observations will experience an earnings increase.
Panel B is restricted to the analyst sample (1983-2021) and includes analyst forecasts issued within one, three, and six months from the previous year's earnings release.
Compared to GPT, financial analysts tend to be slightly more pessimistic in their forecasts.
Companies in the Analyst Sample are, on average, larger in size, have a lower book-to-market ratio, higher leverage, and lower earnings volatility compared to the full sample, but are similar in terms of the actual frequency of EPS increases.

The methodology and data section highlights the authors' approach to using GPT-4 for financial statement analysis and earnings prediction, focusing on a narrow information set to test the model's limits.

The use of both simple and CoT prompts allows for a comparison of the model's performance with and without guided reasoning. The data from Compustat and IBES provides a comprehensive sample for testing the model's predictions and comparing them to human analysts' forecasts.

Performance versus the analysts

The paper compares the performance of GPT-4 in predicting the direction of future earnings based on financial statement analysis to that of financial analysts.

The authors use several methods to evaluate the model's performance and explore the complementarity between human analysts and GPT. Here's a critical assessment of the main results:

Prediction accuracy

GPT-4 with a simple prompt achieves an accuracy of 52.33% and an F1-score of 54.52%, which is on par with the first-month consensus forecasts by financial analysts following the earnings release.
When using chain-of-thought (CoT) prompts, GPT-4 achieves an accuracy of 60.35%, outperforming analyst predictions by 7 percentage points, even without access to narrative or contextual information available to analysts.
While the results are impressive, it's important to note that the paper does not provide a detailed explanation of how the CoT prompts were designed or optimised.
The specific instructions used in the CoT prompts could have a significant impact on the model's performance, and more transparency in this regard would strengthen the findings.

Complementarity between human analysts and GPT

The authors explore instances where forecasts are erroneous and find that GPT-4's predictions are more likely to be inaccurate for smaller firms, firms with higher leverage ratios, loss-making firms, and firms with volatile earnings.

However, the magnitude of these effects is smaller for human analysts, suggesting that they benefit from access to soft information and additional context.

The incremental informativeness analysis shows that both GPT-4 and analyst forecasts are positively associated with future outcomes, and their combined use improves the adjusted R-squared, indicating complementarity.
While these findings are valuable, the paper does not look into the specific types of soft information or context that human analysts might rely on. A more detailed discussion of these factors could provide a clearer understanding of the relative strengths and weaknesses of GPT-4 and human analysts.

Overall, the paper presents compelling evidence that GPT-4 can outperform human analysts in predicting the direction of future earnings based on financial statement analysis, even without access to the same level of contextual information.

The authors also demonstrate the complementarity between GPT-4 and human analysts, highlighting the potential for the model to add value in situations where humans struggle.

However, the paper could benefit from more transparency in the design of the CoT prompts and a more detailed discussion of the specific factors that contribute to the relative strengths and weaknesses of GPT-4 and human analysts.

Additionally, the authors could explore the potential limitations of GPT-4, such as its ability to handle novel or unusual financial situations that may not be well-represented in its training data.

Despite these limitations, the paper makes a significant contribution to the literature on the application of large language models in financial analysis and provides a strong foundation for future research in this area.

Predictive Ability

This section aims to understand the sources of GPT-4's predictive ability and explores two potential explanations:

the model's memory and its ability to generate narrative insights based on numeric data. The authors attempt to rule out the possibility of look-ahead bias and investigate whether the model's generated texts are informative.

Here are some potential biases and ways that could annul the experimental findings:

Look-ahead bias

The authors argue that their research design is relatively immune to look-ahead bias because they use a consistent anonymised format for financial statements, making it difficult for the model to infer a firm's identity or the specific year.
However, there might be subtle patterns or combinations of financial ratios that are unique to certain companies or industries, which GPT-4 could potentially recognise based on its training data. If this is the case, the model's predictive ability could be overstated.
To further investigate this, the authors could conduct additional experiments with synthetic financial data that preserve the overall statistical properties of the original data but break any potential links to specific companies or industries.

Experiment design

The authors use chain-of-thought (CoT) prompts to guide GPT-4 in analysing financial statements, which could inadvertently introduce bias in the model's predictions.
The specific instructions provided in the CoT prompts might steer the model towards focusing on certain financial ratios or trends that are known to be predictive of future earnings, thus inflating its performance. This would be good!
To address this concern, the authors could experiment with different sets of CoT prompts that vary in their level of specificity and guidance to assess the sensitivity of the results to the prompt design.

Selection bias

The paper uses the entire universe of Compustat annual financial data from 1968 to 2021, which could introduce selection bias if the dataset is not representative of the broader population of companies.
If GPT-4's training data overrepresents certain types of companies or industries that are more predictable, the model's performance might be overstated.
To mitigate this issue, the authors could conduct robustness tests using alternative datasets or by stratifying the sample based on company characteristics to ensure that the results are not driven by specific subsets of the data.

Temporal bias

The authors find that GPT-4's predictive accuracy decreases over time, with sharp drops during international macroeconomic downturns.
If the model's training data is skewed towards more recent years or if it overrepresents certain economic conditions, its predictive ability might not generalise well to other time periods or market environments.
To address this concern, the authors could perform additional tests using rolling windows or by explicitly controlling for macroeconomic factors to assess the stability of the model's performance across different time periods.

Information leakage

While the authors aim to use only numerical data from financial statements, there is a risk that the process of extracting and preprocessing the data could inadvertently introduce information leakage.
For example, if the data preparation process involves any form of normalization or scaling based on future information, it could artificially inflate the model's predictive performance.
To mitigate this risk, the authors should carefully review their data preprocessing pipeline and ensure that all transformations are based solely on information available at the time of prediction.

Confounding factors

The authors argue that GPT-4's predictive ability stems from its capacity to generate narrative insights based on numeric data.
However, there might be confounding factors that drive both the model's generated texts and its predictive performance, such as the underlying quality or complexity of the financial statements.
To disentangle these effects, the authors could control for various measures of financial statement quality or complexity in their analyses and assess whether GPT-4's predictive ability persists after accounting for these factors.

While the authors have taken steps to address potential biases and alternative explanations, further tests and robustness checks could help strengthen the validity of their findings. By considering and addressing these potential issues, the paper can provide more convincing evidence of GPT-4's ability to generate valuable insights from financial statements and its potential for enhancing financial analysis.

Trading Strategy Performance

In this section, the authors investigate the practical value of GPT-4-based financial statement analysis by evaluating the performance of trading strategies based on the model's output.

They argue that if GPT-4's forecasts contain incremental information about future profitability, they should also predict future stock returns.

The authors compare the performance of three types of strategies: one based on GPT-4 forecasts, and two others based on artificial neural network (ANN) and logistic regression forecasts that rely on numeric information.

Here's a detailed explanation of the methodology and results:

Methodology:

a. Portfolio formation:

The authors form portfolios on June 30 of each year, allowing approximately three months for the market to process the reported financial information, and hold the portfolios for one year.
For ANN and logistic regression strategies, stocks are sorted into ten portfolios based on the predicted probabilities of earnings increase. The strategies take long positions in the top decile stocks and short positions in the bottom decile.
For GPT-4 strategies, the authors use binary directional predictions, magnitude predictions, and average log probabilities of tokens to form portfolios. They select stocks predicted to experience a "moderate" or "large" earnings increase, sort them by log probability values, and retain the top 10% with the highest expected confidence for long positions. Similarly, they select stocks predicted to experience a "moderate" or "large" earnings decrease, sort them by log probability values, and short the top 10% with the highest expected confidence.

b. Performance evaluation:

The authors compute Sharpe ratios for equal-weighted and value-weighted portfolios, with monthly rebalancing for the latter.
They also calculate monthly alphas for each strategy based on five different factor models, ranging from the Capital Asset Pricing Model (CAPM) to the Fama-French five-factor model plus momentum.

Results

a. Sharpe ratios:

Equal-weighted portfolios based on GPT-4 predictions achieve a Sharpe ratio of 3.36, substantially higher than ANN-based (2.54) and logistic regression-based (2.05) portfolios.
For value-weighted portfolios, ANN performs better (Sharpe ratio of 1.79) than GPT-4 (1.47), with both outperforming logistic regressions (0.81).

b. Alphas:

Equal-weighted portfolios generate higher alphas in general.
After controlling for five factors and momentum, equal-weighted portfolios based on GPT-4 predictions generate a monthly alpha of 84 basis points (10% annually), higher than ANN (60 basis points) and logistic regression (43 basis points) strategies.
For value-weighted portfolios, ANN-based portfolios perform better than GPT-4, with monthly alphas of 50 basis points and 37 basis points, respectively, after controlling for five factors and momentum.

c. Cumulative returns:

The authors plot the cumulative log returns of equal-weighted portfolios based on GPT-4 predictions from 1968 to 2021, showing that the long portfolio substantially outperforms the short portfolio.
The long-short portfolio consistently outperforms the market portfolio, even when the market experiences negative cumulative returns.

The results demonstrate the potential value of GPT-4-based fundamental analysis in stock markets.

The stronger performance of GPT-4 compared to ANN for equal-weighted strategies and the weaker performance for value-weighted strategies suggest that GPT-4 may have an advantage in uncovering value in smaller stocks.

This finding is consistent with the authors' earlier results showing that GPT-4 appears to have an edge in analysing smaller and relatively more volatile companies.

Overall, this section provides compelling evidence for the practical utility of GPT-4-based financial statement analysis in generating profitable trading strategies.

The authors' approach to portfolio formation and performance evaluation is comprehensive and well-justified, making a strong case for the potential of large language models in enhancing investment decision-making.

Example Output

The example output provided by GPT-4 in Appendix C demonstrates the model's step-by-step approach to analysing financial statements and predicting future earnings.

Let's break down the analysis:

Panel A. Trend Analysis:

GPT-4 identifies significant trends in the company's financial performance over the past three years.
The model notes a consistent upward trend in sales, indicating strong market demand for the company's products or services.
However, the cost of goods sold has also increased substantially, potentially affecting profitability.
GPT-4 observes that the gross profit has increased, albeit at a slower pace, suggesting that the company has been able to maintain a degree of pricing power or cost efficiency.

Panel B. Ratio Analysis:

GPT-4 calculates and interprets key financial ratios to assess the company's performance.
The model notes an improvement in the operating margin, which indicates better cost management or increased efficiency.
GPT-4 also points out the increase in the asset turnover ratio, suggesting improved efficiency in utilizing assets.
The model highlights a relative decline in sales efficiency, which could be a concern and may require further investigation.
GPT-4 concludes that the company shows potential for improved profitability if cost management is maintained.

Panel C. Rationale:

GPT-4 synthesises the insights from the trend and ratio analyses to make a prediction about future earnings.
The model expects the company to effectively manage its operating expenses, driven by the observed revenue growth trend and the improvement in operating margin.
GPT-4 introduces some uncertainty into the prediction by acknowledging that the magnitude of EPS growth depends on the company's ability to maintain a degree of pricing power or cost efficiency.
The model predicts a "moderate" change in EPS, showing potential for improved profitability, with a prediction certainty of 0.7.

The example output demonstrates GPT-4's ability to perform a structured analysis of financial statements, identifying key trends, calculating and interpreting relevant ratios, and synthesising the information to make a prediction about future earnings.

The model's reasoning closely resembles the thought process of a human analyst, considering both quantitative and qualitative factors.

Interestingly, GPT-4 not only provides a binary prediction (increase or decrease) but also offers insights into the magnitude of the expected change and the level of certainty associated with its prediction. This additional information could be valuable for decision-makers in assessing the reliability and potential impact of the model's predictions.

Overall, the example output showcases GPT-4's capability to generate meaningful and nuanced insights from financial statements, highlighting its potential as a tool for enhancing financial analysis and decision-making.

The model's step-by-step approach and clear articulation of its reasoning make the output interpretable and actionable, which are crucial factors in the practical application of AI-based financial analysis.

PreviousThree ideas for autonomous agent applications NextThe Evolution of AI Agents and Their Potential for Augmenting Human Agency

Last updated 1 month ago