On Program Synthesis and Large Language Models

Much has been made of the abilities of the new developments in machine intelligence and in particular of what chatbots such as ChatGPT that are based on large language models (LLMs) are capable of. While these new pieces of software are impressive when it comes to generating text, some people in the computing community take this observation much further and, in my opinion, much too far. They claim programming will be a thing of the past. In a January 2023 Communications column, Matt Welsh put forward this opinion: “Programming will be obsolete. I believe the conventional idea of ‘writing a program’ is headed for extinction, and indeed, for all but very specialized applications, most software, as we know it, will be replaced by AI systems that are trained rather than programmed. In situations where one needs a ‘simple’ program (after all, not everything should require a model of hundreds of billions of parameters running on a cluster of GPUs), those programs will, themselves, be generated by an AI rather than coded by hand.”¹⁴ This quote contains two claims. First, that most software in the future that is not “simple” will take the form of AI systems. Secondly, that any software not of this form will be automatically generated. A consequence of this rather sweeping claim is that since there is no need to program, there is no need to study programming or properties of programs. Computer science can finally go away!

The first claim made by Welsh can never be refuted. For what does it mean that a piece of software is “simple”? The phrasing of the claim largely implies any piece of software that does not take the form of an AI system is simple.

I will therefore not concentrate on this but remark that unless one is of the opinion that systems software such as operating systems, game engines, and SaaS platforms constitute “simple software,” the first statement implies that these fundamental examples of software can be replaced by AI systems. This is in itself dubious.

The second claim is of a different nature and concerns the generation of program code. In an earlier opinion piece in Communications, Daniel Yellin addressed the shortcomings of the use of LLMs from the point of view of software development practice.¹⁵ In this Opinion column, I will instead focus on the fundamental limitations imposed by theorems from the theory of computation and what they tell us about generating program code from specification. The conclusion is the same: The end of programming is not nigh.

Programming in English?

Around the same time as Welsh, Andrej Karpathy wrote on Twitter: “The hottest new programming language is English.”⁷ This idea did not arise with any of the current software that uses LLMs. The idea that AI software using LLMs can be used for “programming in a natural language” dates back to at least July 2020. In 2020, when GPT-3 came into existence, Osama Qarem wrote: “Using GPT-3 will still be programming. It will be coding in a language that uses English as its syntax. Let’s consider that for a moment: Human spoken languages are less concise, less clear and more prone to misunderstanding. People will need to practice the right nouns, adjectives, verbs, and so forth to use when talking to GPT-3 to get the outcome they want. You would need to learn to “debug” your English sentences according to how GPT-3 understands it, and not according to what you really meant.”¹²

But the idea is much older. In the FORTRAN report from 1954,¹ the authors claimed that “FORTRAN should virtually eliminate coding and debugging.” As we know, this did not happen. John Backus, the primary author of FORTRAN, instead went on to design a succession of influential programming languages and this work earned him the ACM A.M. Turing Award. A decade later, in 1966, Jean Sammet wrote in Communications that English should be the programming language of the future.¹¹ This also did not happen. Edsger W. Dijkstra considered the idea “foolish.”⁴

On the other hand, we now see widely published examples of seemingly correct program code generated by English-language queries to ChatGPT and many are impressed by this. One may be tempted to think that in this day and age we are finally witnessing the end of programming and the advent of English (or natural language in general) as the main language for creating programs.

However, generating program code from specifications is not a trivial task. Even simple versions of the approach are intractable in the very precise sense of computational complexity theory if we are concerned about generating code that is correct.

Generating Correct Program Code Is Hard

The problem of generating correct program code from a specification is well known in computer science and in fact it is a central one. It is the problem of program synthesis. Gulwani et al. write⁶ “Program Synthesis is the task of automatically finding programs from the underlying programming language that satisfy user intent expressed in some form of constraints. Unlike typical compilers that translate a fully specified high-level code to low-level machine representation using a syntax-directed translation, program synthesizers typically perform some form of search over the space of programs to generate a program that is consistent with a variety of constraints (for example, input-output examples, demonstrations, natural language, partial programs, and assertions).”

A lot is already known about the synthesis problem in its various guises, and in particular we know it is hard, where “hard” means computationally hard in the sense of computational complexity theory.

In 1979, Richard Statman proved¹³ the problem of whether there exists some term in the simply typed λ-calculus that has a given type τ is PSPACE-complete.

The exact resource requirements for decision algorithms for such problems are an open problem in computational complexity theory. However, the general consensus is that a decision algorithm for a PSPACE-complete problem will have a running time exponential in the size of the input. So for even moderate large input, any decision algorithm for a PSPACE-complete problem is likely to be very slow. For a specification S of size n, an algorithm will need the order of 2ⁿ steps to produce a program satisfying S, if such a program exists.

One could instead use a form of first-order logic called fully quantified Boolean formulae (QBF) as a specification language and ask if a given formula ϕ has a model. If the logical formula speaks about programs, this model will be a program. However, this problem is also PSPACE-complete, as shown by Meyer.⁹ A consequence of these theorems in the mathematical theory of computation and many others of the same kind is that it is extremely likely that an algorithm for program synthesis of non-trivial program properties will require an unreasonable amount of resources.

The End Is Not Nigh

It should not come as a surprise that computer science has not given up on program synthesis. Far from it. Gulwani et al.⁶ give a comprehensive survey of existing approaches to program synthesis.⁶ All of these approaches have limitations.

Some approaches to program synthesis are approximative. This means that they will sometimes not be able to construct a program that satisfies the full specification given. This is the case for tools for static program analysis—and moreover, such tools will always consider particular program properties (such as non-interference or liveness) only.

Other approaches will sometimes require an unreasonable amount of resources. In some cases the synthesis will require a lot of memory or a lot of time in order to build a program. This is the case for approaches that use SMT solvers and for forms of type-based development such as those that are based on the Hindley-Milner type system.¹⁰

Yet other approaches can only handle specifications written in a small language for which synthesis is possible within reasonable resource constraints or within narrow problem domains. This is the case for tools for generating compilers and interpreters, where the use of regular expressions and attribute grammars allows those who design, define and implement programming languages to specify the expected behavior of an interpreter or a compiler.

Not only are all of these approaches specialized but have one important common characteristic: They were never hailed as a sign of “the end of programming” (let alone the end of computer science!) but were rightfully viewed as another important tool in the toolbox of program development. It is unreasonable to expect the generational powers of ChatGPT and similar pieces of software to surpass these limitations, let alone to be able to provide a general tool for program synthesis.

But why is English not the way out? Because we know that existing formal specification languages are (at least up to a choice of syntax) sublanguages of English and probably every other natural language used today (including my native language, Danish), so the problem of program synthesis can be no easier for English (or Danish) than for these quite simple specification languages.

English has the disadvantage of being semantically ambiguous so one has to be extremely careful, as many software developers will know. There is a reason why English-language specifications are not central to the research areas of program analysis and program verification in computer science.

Some may now argue that LLMs can still become a useful tool under certain circumstances. And indeed anyone who has used AI software of this kind in the setting of programming will have seen that sometimes it can produce reasonable code that appears to be correct. At other times, the AI software will spout program code that is meaningless as well as useless.

It is no coincidence that we see this behavior, as the LLMs are trained on existing code. GitHub Copilot is trained on code that appears in public repositories on GitHub.⁵ This means that whatever code is being generated will reflect the style of coding that the chatbot was exposed to during the training phase—which in itself is worth noting as a limitation of this particular approach.

Even if chatbots should become a useful tool for programmers, their existence cannot mean the end of programming. It may well be the case that this combination of a built-in repository of code examples and the ability to easily retrieve code examples based on it can be used for generating ideas and for forms of “pair programming” as recently studied by Bird et al.²

But this is not program synthesis, rather it gives software using LLMs the role of more modest and possibly useful tools. We are now seeing a flurry of interest in combining traditional approaches to program analysis with LLMs, and if anything, this points to the limitations inherent in both approaches.³^,⁸

In my opinion, the real use of natural language in software development lies in the exploratory dialogue that occurs between software developers and the end users of the software being developed. Here, the goal is to eventually understand the requirements so they can be transformed into running program code that meets the requirements. Here, too, software based on LLMs may have a role as a shared tool for facilitating a dialogue, but whatever this is, programming it is not.

The important problem of program correctness is still outstanding and remains one of the most important problems in computer science.

Programming in English?

Generating Correct Program Code Is Hard

The End Is Not Nigh

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签