Coffee: Boost Your Code LLMs by Fixing Bugs with Feedback

🔔News

♻ [2024-02-21]: We have updated our paper and released the second version of our draft on Arxiv!

🛎[2024-02-22]: We have made our code available on GitHub! Check out our Github repository for more details.

✨[2023-12-24]: We have open-sourced the CoffeePots-Critic Model.

🍪[2023-12-24]: We have open-sourced the CoffeePots-Editor Model.

📑[2023-11-19]: We have uploaded the first version of our preprint to arxiv [Link to our paper]

☕[2023-11-11]: We have open-sourced Coffee dataset used in our project named Coffee.

Introduction

Code editing is an essential step towards reliable program synthesis to automatically correct critical errors generated from code LLMs. Recent studies have demonstrated that closed-source LLMs (i.e., ChatGPT and GPT-4) are capable of generating corrective feedback to edit erroneous inputs. However, it remains challenging for open-source code LLMs to generate feedback for code editing, since these models tend to adhere to the superficial formats of feedback and provide feedback with misleading information. Hence, the focus of our work is to leverage open-source code LLMs to generate helpful feedback with correct guidance for code editing. To this end, we present Coffee, a collected dataset specifically designed for code fixing with feedback. Using this dataset, we construct CoffeePots, a framework for COde Fixing with FEEdback via Preference-Optimized Tuning and Selection. The proposed framework aims to automatically generate helpful feedback for code editing while minimizing the potential risk of superficial feedback. The combination of Coffee and CoffeePots marks a significant advancement, achieving state-of-the-art performance on HumanEvalFix benchmark.

Overview

An example instance from COFFEE dataset.

Recent large language models have demonstrated promising capabilities in correcting their codes based on natural language feedback. However, this ability is currently only applicable to open-source models (e.g., GPT-3.5-Turbo and GPT-4), and not to closed-source models. This poses a significant safety and privacy concern as the codes cannot be uploaded to any external servers (e.g., OpenAI API server).

To this end, we introduce ☕COFFEE, a dataset for code editing with feedback. Our dataset includes diverse solutions to programming problems collected from an online competitive programming platform. For each solution, we additionally annotate natural language feedback to provide detailed explanations for the errors towards correct edits, and augment synthetic test cases to measure the correctness of the edited solutions.

Is Training on COFFEE with Supervised Finetuning Enough?

Pass@1 results of code editing with SFT feedback on the test set of COFFEE compared with editing with ChatGPT feedback and editing without any feedback settings.

No.

As we show in the figure above, we find that training a critic model that generates natural language feedback on a given erroneous code with next token prediction obejct cannot produce any helpful deedback. Editing code with feedback from SFT critic shows performance even worse than direct editing (i.e., editing w/o feedback). We posit that learning to generate accurate feedback is a very challenging goal that cannot be achieved only with next-token prediction object.

Human evaluation on the quality of feedback from SFT critic.

Further analysis of the feedback quality suggests that the feedback from the SFT-trained critic has not yet reached a satisfactory level. It remains only as 'Partially Correct' feedback, resulting the decreased performance in code editing as we show in the previous analysis.

Overview of COFFEEPOTs.

To resolve the aforementioned issue, we introduce COFFEEPOTS, a framework for COde Fixing with FEEdback via Preference-Optimized Tuning and Selection. We first use our COFFEE dataset to train code LLMs via supervised fine-tuning (SFT) for feedback augmented code editing. Then, we additionally leverage synthetic test cases in COFFEE to annotate preferred (i.e., helpful) solutions and apply preference alignment to guide the generation of helpful feedback.

Performances in editing machine-generated codes. We report pass@1 and ERR (in parentheses). We use ChatGPT (the first row) to generate codes for problems from several benchmark datasets for code generation.

Table above reports the model performance in editing solutions generated from ChatGPT for problems in HumanEvalSynthesize, MBPP, and APPS. CoffeePots outperforms all open-source baselines, including Code Llama (13B), the previous SOTA among open-source code LLMs. Furthermore, CoffeePots shows better results than feedback-augmented Code Llama (13B), i.e., prompted with Self-Refine and Self-Debug, suggesting the effectiveness of our strategy on generating feedback. In addition, while some open-source code LLMs show almost no improvement in MBPP and APPS (i.e., 0% ERR), CoffeePots shows moderate improvements on these benchmarks (i.e., up to 7.5% ERR). Compared to closed-source baselines (i.e., ChatGPT), CoffeePots achieves competitive results particularly on HumanEvalSynthesize and MBPP, showing that our framework can serve as a strong alternative to closed-source LLMs while being publicly available and much smaller in size.

Example

BibTeX


        @article{Anonymized,
        title={Coffee: Boost your code llms by fixing bugs with feedback},
        author={Anonymized},
        journal={Anonymized},
        year={2023}
        }

Coffee

Boost Your Code LLMs by Fixing Bugs with Feedback

Note that this project page is fully anonymized. Some links might not be available due to anonymization.

🔔News

Introduction

Coffee: dataset for COde Fixing with FEEdback

Overview

COFFEEPOTS: Aligning Feedback with Preferred Edits

Is Training on COFFEE with Supervised Finetuning Enough?

No.

Main Results

Example

BibTeX