Steering Large Language Models between Code Execution and Textual Reasoning

While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi-agent framework or reasoning chains, several benchmark tasks can be solved with 100% success through direct coding, which is more scalable and avoids the com...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Chen, Yongchao, Jhamtani, Harsh, Sharma, Srinagesh, Fan, Chuchu, Wang, Chi
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 04.10.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi-agent framework or reasoning chains, several benchmark tasks can be solved with 100% success through direct coding, which is more scalable and avoids the computational overhead associated with textual iterating and searching. Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching, which is unlikely to be solved by simply scaling up the model and data size. The recently released OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution to solve complex tasks using LLMs. However, based on our experiments on 7 existing popular methods for steering code/text generation in both single- and multi-turn settings with 14 tasks and 6 types of LLMs (including the new O1-preview), currently there is no optimal method to correctly steer LLMs to write code when needed. We discover some interesting patterns on when models use code vs. textual reasoning with the evolution to task complexity and model sizes, which even result in an astonishingly inverse scaling law. We also discover that results from LLM written code are not always better than using textual reasoning, even if the task could be solved through code. To mitigate the above issues, we propose three methods to better steer LLM code/text generation and achieve a notable improvement. The costs of token lengths and runtime are thoroughly discussed for all the methods. We believe the problem of steering LLM code/text generation is critical for future research and has much space for further improvement. Project Page, Datasets, and Codes are available at https://yongchao98.github.io/CodeSteer/.
AbstractList While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi-agent framework or reasoning chains, several benchmark tasks can be solved with 100% success through direct coding, which is more scalable and avoids the computational overhead associated with textual iterating and searching. Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching, which is unlikely to be solved by simply scaling up the model and data size. The recently released OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution to solve complex tasks using LLMs. However, based on our experiments on 7 existing popular methods for steering code/text generation in both single- and multi-turn settings with 14 tasks and 6 types of LLMs (including the new O1-preview), currently there is no optimal method to correctly steer LLMs to write code when needed. We discover some interesting patterns on when models use code vs. textual reasoning with the evolution to task complexity and model sizes, which even result in an astonishingly inverse scaling law. We also discover that results from LLM written code are not always better than using textual reasoning, even if the task could be solved through code. To mitigate the above issues, we propose three methods to better steer LLM code/text generation and achieve a notable improvement. The costs of token lengths and runtime are thoroughly discussed for all the methods. We believe the problem of steering LLM code/text generation is critical for future research and has much space for further improvement. Project Page, Datasets, and Codes are available at https://yongchao98.github.io/CodeSteer/.
Author Jhamtani, Harsh
Sharma, Srinagesh
Wang, Chi
Fan, Chuchu
Chen, Yongchao
Author_xml – sequence: 1
  givenname: Yongchao
  surname: Chen
  fullname: Chen, Yongchao
– sequence: 2
  givenname: Harsh
  surname: Jhamtani
  fullname: Jhamtani, Harsh
– sequence: 3
  givenname: Srinagesh
  surname: Sharma
  fullname: Sharma, Srinagesh
– sequence: 4
  givenname: Chuchu
  surname: Fan
  fullname: Fan, Chuchu
– sequence: 5
  givenname: Chi
  surname: Wang
  fullname: Wang, Chi
BookMark eNqNjM8KgkAYxJcoyMp3WOgsuLtqehYjqC7pXbb8EkW-rf1DPn576AG6zMyPGWZDlqgQFiTgQrAoTzhfk9CYMY5jnh14moqAnGsLoAfs6UXqHrxi76QPV9XBZOgd7AcAaemRVjM8nB0UUokdbWC2Tk70BtIo9Bc7snrKyUD48y3ZH6umPEUvrd4OjG1H5TT6qhWMiTwpioyJ_1ZfOms-lQ
ContentType Paper
Copyright 2024. This work is published under http://creativecommons.org/publicdomain/zero/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2024. This work is published under http://creativecommons.org/publicdomain/zero/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID 8FE
8FG
ABJCF
ABUWG
AFKRA
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
HCIFZ
L6V
M7S
PIMPY
PQEST
PQQKQ
PQUKI
PRINS
PTHSS
DatabaseName ProQuest SciTech Collection
ProQuest Technology Collection
Materials Science & Engineering Collection
ProQuest Central (Alumni)
ProQuest Central
ProQuest Central Essentials
ProQuest Central
Technology Collection
ProQuest One Community College
ProQuest Central
SciTech Premium Collection
ProQuest Engineering Collection
Engineering Database
Publicly Available Content Database
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
Engineering Collection
DatabaseTitle Publicly Available Content Database
Engineering Database
Technology Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest Engineering Collection
ProQuest One Academic UKI Edition
ProQuest Central Korea
Materials Science & Engineering Collection
ProQuest One Academic
Engineering Collection
DatabaseTitleList Publicly Available Content Database
Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 2331-8422
Genre Working Paper/Pre-Print
GroupedDBID 8FE
8FG
ABJCF
ABUWG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
FRJ
HCIFZ
L6V
M7S
M~E
PIMPY
PQEST
PQQKQ
PQUKI
PRINS
PTHSS
ID FETCH-proquest_journals_31138499613
IEDL.DBID BENPR
IngestDate Thu Oct 10 20:58:31 EDT 2024
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-proquest_journals_31138499613
OpenAccessLink https://www.proquest.com/docview/3113849961?pq-origsite=%requestingapplication%
PQID 3113849961
PQPubID 2050157
ParticipantIDs proquest_journals_3113849961
PublicationCentury 2000
PublicationDate 20241004
PublicationDateYYYYMMDD 2024-10-04
PublicationDate_xml – month: 10
  year: 2024
  text: 20241004
  day: 04
PublicationDecade 2020
PublicationPlace Ithaca
PublicationPlace_xml – name: Ithaca
PublicationTitle arXiv.org
PublicationYear 2024
Publisher Cornell University Library, arXiv.org
Publisher_xml – name: Cornell University Library, arXiv.org
SSID ssj0002672553
Score 3.5654173
SecondaryResourceType preprint
Snippet While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi-agent framework...
SourceID proquest
SourceType Aggregation Database
SubjectTerms Large language models
Multiagent systems
Optimization
Reasoning
Scaling laws
Searching
Steering
Task complexity
Title Steering Large Language Models between Code Execution and Textual Reasoning
URI https://www.proquest.com/docview/3113849961
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQAfYALJJTEo11U9NSLXVN0kDlYBJor0xqErBxamqaZmwC2ijs62fmEWriFWEaAR1wK4Yuq4SVieCCOiU_GTRGrm9saGhsAWyemxnaFxTqgm6NAs2uQq_QYGZgNQL2FAxYGFidXP0CguCjLEZm5sA2szFGQQuuPdwEGVgDEgtSi4QYmFLzhBnYwYsuk4tFGLyDSyAnASr4gNZjA0nI2KEC6IKynGIF6BoqBWcgV8G1IjUZnEgUgH1_hRBgmVqamKMQlJpYDB5QFWVQdnMNcfbQhbkgHppKiuMRfjIWY2ABdvdTJRgUzEDH6ZibmxskW6SYmJiaWwJzVGJSoiXoIDBQC0ySQQafSVL4paUZuIyA1TJ4OZqJDANLSVFpqiywWi1JkmNgtnBzl4OGIJDnW-cKAK8agW8
link.rule.ids 783,787,12777,21400,33385,33756,43612,43817
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1NS8NAEB20QfTmJ2qrLug1aLqbr5NgSYk2DaVG6C1sks2paG1S8Od3ZrvVg9BLYAmE7DL7Zubt2xmAB8wAgrKS3Fa1Cm1REw4WdFdGFRicum7NBV0UHqde_CHeZu7MEG6NkVVuMVEDdfVVEkf-yB2HBxiee87z4tumrlF0umpaaOyDRaWqMPmyXqJ0Mv1lWfqejzEz_we02nsMj8GayIVansCe-jyFAy26LJszGL23m0qALCE9Nj433CGjBmXzhhkNFRvgkEU_qtRGwjD3Zxli6krO2VTJRhOq53A_jLJBbG__IDdW0uR_c-IX0MF0X10C86icju_7T2VQCeH6Ie4oWciQCoFRBHYFvV1fut79-g4O42yc5MlrOurCUR9dtJamiR502uVK3aCLbYtbs45r25SCUg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Steering+Large+Language+Models+between+Code+Execution+and+Textual+Reasoning&rft.jtitle=arXiv.org&rft.au=Chen%2C+Yongchao&rft.au=Jhamtani%2C+Harsh&rft.au=Sharma%2C+Srinagesh&rft.au=Fan%2C+Chuchu&rft.date=2024-10-04&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422