ml4code
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -294,6 +294,7 @@
 {"key": "nguyen2020suggesting", "year": "2020", "title":"Suggesting Natural Method Names to Check Name Consistencies", "abstract": "<p>Misleading names of the methods in a project or the APIs in a software library confuse developers about program functionality\nand API usages, leading to API misuses and defects. In this paper,we introduce MNire, a machine learning approach to check the\nconsistency between the name of a given method and its implementation. MNire first generates a candidate name and compares the\ncurrent name against it. If the two names are sufficiently similar, we consider the method as consistent. To generate the method name,\nwe draw our ideas and intuition from an empirical study on the nature of method names in a large dataset. Our key finding is that\nhigh proportions of the tokens of method names can be found in the three contexts of a given method including its body,\nthe interface (the method’s parameter types and return type), and the enclosing class’ name. Even when such tokens are not there,\nMNire uses the contexts to predict the tokens due to the high likelihoods of their co-occurrences. Our unique idea is to treat\nthe name generation as an abstract summarization on the tokens collected from the names of the program entities in the three\nabove contexts.</p>\n\n<p>We conducted several experiments to evaluate MNire in method name consistency checking and in method name\nrecommending on large datasets with +14M methods. In detecting inconsistency method names, MNire improves the state-of-the-art\napproach by 10.4% and 11% relatively in recall and precision, respectively. In method name recommendation, MNire improves relatively\nover the state-of-the-art technique, code2vec, in both recall (18.2% higher) and precision (11.1% higher). To assess MNire’s usefulness,\nwe used it to detect inconsistent methods and suggest new names in several active, GitHub projects. We made 50 pull requests (PRs) and received\n42 responses. Among them, five PRs were merged into the main branch, and 13 were approved for later merging. In total, in 31/42 cases,\nthe developer teams agree that our suggested names are more meaningful than the current names, showing MNire’s usefulness.</p>\n", "tags": ["naming"] },
 {"key": "nie2021evaluation", "year": "2021", "title":"Impact of Evaluation Methodologies on Code Summarization", "abstract": "<p>There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization considered the timestamps of code and comments during evaluation. This may lead to evaluations that are inconsistent with the intended use cases. In this paper, we introduce the time-segmented evaluation methodology, which is novel to the code summarization research community, and compare it with the mixed-project and cross-project methodologies that have been commonly used. Each methodology can be mapped to some use cases, and the time-segmented methodology should be adopted in the evaluation of ML models for code summarization. To assess the impact of methodologies, we collect a dataset of (code, comment) pairs with timestamps to train and evaluate several recent ML models for code summarization. Our experiments show that different methodologies lead to conflicting evaluation results. We invite the community to expand the set of methodologies used in evaluations.</p>\n", "tags": ["evaluation","dataset"] },
 {"key": "nijkamp2022conversational", "year": "2022", "title":"A Conversational Paradigm for Program Synthesis", "abstract": "<p>Program synthesis strives to generate a computer program as a solution to a given problem specification. We propose a conversational program synthesis approach via large language models, which addresses the challenges of searching over a vast program space and user intent specification faced in prior approaches. Our new approach casts the process of writing a specification and program as a multi-turn conversation between a user and a system. It treats program synthesis as a sequence prediction problem, in which the specification is expressed in natural language and the desired program is conditionally sampled. We train a family of large language models, called CodeGen, on natural language and programming language data. With weak supervision in the data and the scaling up of data size and model size, conversational capacities emerge from the simple autoregressive language modeling. To study the model behavior on conversational program synthesis, we develop a multi-turn programming benchmark (MTPB), where solving each problem requires multi-step synthesis via multi-turn conversation between the user and the model. Our findings show the emergence of conversational capabilities and the effectiveness of the proposed conversational program synthesis paradigm. In addition, our model CodeGen (with up to 16B parameters trained on TPU-v4) outperforms OpenAI’s Codex on the HumanEval benchmark. We make the training library JaxFormer including checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.</p>\n", "tags": ["Transformer","synthesis"] },
+{"key": "nijkamp2023codegen2", "year": "2023", "title":"CodeGen2: Lessons for Training LLMs on Programming and Natural Languages", "abstract": "<p>Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly.</p>\n\n<p>In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a “free lunch” hypothesis. For data distributions, the effect of a mixture distribution of programming and natural languages on model performance is explored.</p>\n\n<p>We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into four lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen2</p>\n", "tags": ["Transformer"] },
 {"key": "nitin2021direct", "year": "2021", "title":"DIRECT : A Transformer-based Model for Decompiled Identifier Renaming", "abstract": "<p>Decompiling binary executables to high-level code is an important step in reverse engineering scenarios, such as malware analysis and legacy code maintenance. However, the generated high-level code is difficult to understand since the original variable names are lost. In this paper, we leverage transformer models to reconstruct the original variable names from decompiled code. Inherent differences between code and natural language present certain challenges in applying conventional transformer-based architectures to variable name recovery. We propose DIRECT, a novel transformer-based architecture customized specifically for the task at hand. We evaluate our model on a dataset of decompiled functions and find that DIRECT outperforms the previous state-of-the-art model by up to 20%. We also present ablation studies evaluating the impact of each of our modifications. We make the source code of DIRECT available to encourage reproducible research.</p>\n", "tags": ["Transformer","decompilation"] },
 {"key": "niu2022spt-code", "year": "2022", "title":"SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations", "abstract": "<p>Recent years have seen the successful application of large pre-trained modelsto code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding theirapplication to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pre-training tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existingpre-training tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstreamt asks. Experimental results demonstrate that SPT-Code achieves state-of-the-artperformance on five code-related downstream tasks after fine-tuning.</p>\n", "tags": ["Transformer","representation"] },
 {"key": "nye2021program", "year": "2021", "title":"Program Synthesis with Large Language Models", "abstract": "<p>This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model’s ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model’s initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.</p>\n", "tags": ["Transformer","synthesis"] },