Can Large Language Models Reason and Plan?
The answer according to this research is 'no'
Last updated
Copyright Continuum Labs - 2023
The answer according to this research is 'no'
Last updated
The March 2024 paper examines whether Large Language Models (LLMs) can perform planning and reasoning tasks, traditionally associated with higher cognitive functions.
Despite LLMs' impressive linguistic capabilities, the author argues they are essentially sophisticated n-gram models that perform approximate retrieval rather than principled reasoning.
This paper reinforces our view that generative AI is an augmentation to human work and endeavour, not a replacement.
The study tested LLMs like GPT3, GPT3.5, and GPT4 using planning instances from the International Planning Competition and found that while there were improvements in the accuracy of generated plans across versions, the results were not definitive evidence of genuine planning capability.
The paper distinguishes between LLMs generating correct responses through memorisation or pattern recognition and performing actual reasoning. To further test LLMs' planning abilities, the study employed obfuscation techniques, which significantly reduced GPT4's performance, suggesting reliance on retrieval rather than planning.
Two methods were explored to potentially enhance LLMs' planning performance: fine-tuning with planning data and prompting with hints or external verifiers. However, fine-tuning didn't show significant improvement, suggesting that it might lead to better approximate retrieval rather than genuine planning.
The paper advocates for a framework where LLMs' generative capabilities are combined with external verifiers to ensure the correctness and soundness of the planning and reasoning outputs, a setup referred to as LLM-Modulo frameworks.
The study concludes that while LLMs exhibit some level of problem-solving capability, their performance in planning and reasoning tasks is largely based on approximate retrieval and memorisation, not genuine reasoning or planning as traditionally understood in AI.
The paper critiques the common practice of iterative prompting by humans in the loop, which may lead to a Clever Hans effect, where the human's input, rather than the LLM's reasoning, guides the outcome.
This approach is contrasted with self-improvement methods where LLMs critique and refine their own outputs. However, the author finds such self-verification to potentially worsen performance due to LLMs generating both false positives and negatives.
The author suggests a LLM-Modulo framework where LLMs generate potential solutions vetted by external verifiers or expert humans, ensuring a sound outcome. The paper also reflects on the broader implications of LLMs in AI, suggesting they can serve as knowledge sources for domain-specific information, a role previously filled by human knowledge engineers.
In summary, while LLMs demonstrate some level of problem-solving ability, their effectiveness in planning and reasoning is largely attributed to their retrieval capabilities rather than genuine reasoning or planning.
The LLM-Modulo framework is proposed as a principled way to leverage LLMs' idea generation for reasoning tasks, supported by external verification to ensure accuracy and soundness.