Programming Contest Yields Surprising Results

Published

Posted by Rob Knies

Microsoft Research Blog

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.

RiSE logo

What if they held a programming competition involving hundreds of teams—and everybody won?

That might seem improbable, but when viewed through the prism of the contest’s value to its participants, that was precisely what happened during the ICFP Programming Contest 2013.

Four Japanese teams and a Russian team claimed top honors during the contest, held over a 72-hour period from Aug. 8 to 11. The event, held in conjunction with the Association for Computing Machinery’s International Conference on Functional Programming, being held in Boston from Sept. 25 to 27, was organized by two members of the Research in Software Engineering (RiSE) team at Microsoft Research Redmond.

The winning teams were announced on Sept. 26. But in reality, given the wily effectiveness of the triumphant entries, the true winner was the field of programming research itself.

“The winning teams crafted very special-purpose, hand-tuned solutions,” explains Nikhil Swamy, one of the contest organizers, along with RiSE colleague Michał Moskal, “and were able to vastly outperform the general-purpose solutions that are the state of the art in the research literature.”

The contest, first held in 1998, was the 16th in an annual series, generally hosted by an academic institution the likes of Carnegie Mellon University of the United States, Utrecht University of the Netherlands, Germany’s Leipzig University of Applied Science, Japan’s Tohoku University, and the University of St. Andrews in Scotland. Microsoft Research is the contest’s first industrial organizer.

Competing entirely over the Internet, more than 300 participating teams of programmers from around the world were asked to complete a series of programming tasks, using any programming languages and tools they desired, to address an extremely challenging scenario in program synthesis. Results were assessed using Microsoft Research’s Z3 theorem prover running in Windows Azure to compare submitted solutions to actual solutions to determine correctness.

“It was quite a task to get the system running and, more importantly, scalable,” Moskal says. “In the spirit of the contest, we used six different programming languages—the cloud service was written in TypeScript, a statically typed language compiling to JavaScript. Parts of the web page were in plain JavaScript. I also did a spiffy front end in TouchDevelop, a web-based environment we work on that lets you program on a smartphone, a tablet, or a laptop. Nikhil wrote the parser and equivalence checker for solutions in F#, which is particularly suitable for such symbol-manipulation tasks. The equivalence checker generated formulas in the SMT-LIB2 language, then passed to Z3 which is written in C++.”

The contest reflected a resurgence of interest in program synthesis.

“The problem we posed to the contestants is one that is under active research in the computer-science community,” Swamy says. “The program-synthesis problem has been studied for several decades and has seen some recent advancements and increased interest in the last five years or so.”

The generic problem description was “Guess the implementation of a black-box function implemented in a simple functional language through querying a web server for information about the input/output behavior of the function.” The motivation for the contest came from ongoing research in the RiSE group on Pex/Pex4Fun, a tool that automatically synthesizes tests for C# programs, developed by Nikolai Tillmann and Peli de Halleux, and on work in program synthesis pioneered by Sumit Gulwani and his RiSE colleagues, whose related project, Flash Fill, became a heralded new feature in Excel 2013.

“For the class of problems we posed,” Swamy says, “recent research papers report that the state of the art in general-purpose solutions are able to correctly synthesize programs that are on the order of 16 to 18 instructions long within about 30 minutes [on a difficulty progression that scales exponentially with the size of the problem, meaning that a problem of size 17 is at least twice as hard as a problem of size 16]. We had considered that 16-to-18 number to be close to the upper bound of what contestants would be able to solve.”

The organizers, though, were quite surprised—and pleasantly so.

“The winning team was successfully synthesizing programs of size 40 and above in less than five minutes,” Swamy reveals, “by studying the structure of training problems that were made available and ingeniously exploiting particular characteristics of the problems we posed.”

Gauging the performance of the competing teams, which numbered as large of 20 people, was not a simple task.

“Determining whether the guessed program B is equal to the secret program A is a difficult technical problem,” Swamy confirms. “In fact, this problem is impossible to solve in general for arbitrary programs, a classic result in computer science going back to the work of Kurt Gödel and Alan Turing. However, for certain special classes of programs, deciding the equivalence of A and B is possible to do, though it can be computationally very expensive.

“Recent work from the RiSE group, in particular the Z3 theorem prover, is designed to be able to solve these kinds of problems efficiently. Without a solver like Z3, it would be impossible to consider setting a problem of this sort to the contestants, because there would be no way to evaluate and grade their solutions effectively.”

Over the contest’s 72 hours, Z3 received about a million requests and successfully decided all, except about 300 problem instances, within a RiSE-imposed time limit of 20 seconds, the overwhelming majority within a matter of a few milliseconds.

“To scale our use of Z3 to handle simultaneous requests from all the teams, we built a web service on Azure’s elastic computing platform with the ability to run Z3 on a large number of cores simultaneously,” Swamy says. “At peak usage, we were receiving about 40 requests per second, and these were easily being handled by Z3 on Azure using about 23 of the allocated 128 cores.”

One goal of the contest is to demonstrate the capabilities of the contestants’ favorite programming languages and tools. Over the years, contest winners have used Haskell, OCaml, C++, Cilk, Java, and F#. This year’s 72-hour winner, Unagi — The Synthesis, used Java, C#, C++, PHP, Shell, Ruby, and Haskell while running their solution on Amazon’s EC2 cloud.

In addition, the winning team can claim that its preferred language is “the programming tool of choice for discriminating hackers.”

Beyond such levity, though, the competition could signal significant implications for the program-synthesis field.

“By exploiting the structure of the problem,” Swamy says, “Unagi was able to do about 1 million times better than general-purpose solutions! That is very remarkable indeed.

“It suggests, for example, that if there are program fragments that are very important to optimize—to make more efficient or to make correct—then handcrafted program-synthesis solutions using great computing power—Unagi used about a thousand hours of computing time—may be a viable way to proceed.”