ml4code
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -235,6 +235,7 @@
 {"key": "li2022codereviewer", "year": "2022", "title":"CodeReviewer: Pre-Training for Automating Code Review Activities", "abstract": "<p>Code review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, functionality, latency, style and other factors. It turns out that developers have to spend far too much time reviewing the code of their peers. Accordingly, it is in significant demand to automate the code review process. In this research, we focus on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review senario. To evaluate our model, we focus on three key tasks related to code review activities, including code change quality estimation, review comment generation and code refinement. Furthermore, we establish a high-quality benchmark dataset based on our collected data for these three tasks and conduct comprehensive experiments on it. The experimental results demonstrate that our model outperforms the previous state-of-the-art pre-training approaches in all tasks. Further analysis show that our proposed pre-training tasks and the multilingual pre-training dataset benefit the model on the understanding of code changes and reviews.</p>\n", "tags": ["review"] },
 {"key": "li2022exploring", "year": "2022", "title":"Exploring Representation-Level Augmentation for Code Search", "abstract": "<p>Code search, which aims at retrieving the most relevant code fragment for a given natural language query, is a common activity in software development practice. Recently, contrastive learning is widely used in code search research, where many data augmentation approaches for source code (e.g., semantic-preserving program transformation) are proposed to learn better representations.  However, these augmentations are at the raw-data level, which requires additional code analysis in the preprocessing stage and additional training costs in the training stage. In this paper, we explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training, and based on this we propose a general format of representation-level augmentation that unifies existing methods. Then, we propose three new augmentation methods (linear extrapolation, binary interpolation, and Gaussian scaling) based on the general format. Furthermore, we theoretically analyze the advantages of the proposed augmentation methods over traditional contrastive learning methods on code search. We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. The experimental results show that our approach can consistently boost the performance of the studied code search models.</p>\n", "tags": ["search","Transformer"] },
 {"key": "li2023rethinking", "year": "2023", "title":"Rethinking Negative Pairs in Code Search", "abstract": "<p>Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less ``negative’’ than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages.</p>\n", "tags": ["search","Transformer","retrieval","optimization","representation"] },
+{"key": "li2024rewriting", "year": "2024", "title":"Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search", "abstract": "<p>In code search, the Generation-Augmented Retrieval (GAR) framework, which generates exemplar code snippets to augment queries, has emerged as a promising strategy to address the principal challenge of modality misalignment between code snippets and natural language queries, particularly with the demonstrated code generation capabilities of Large Language Models (LLMs). Nevertheless, our preliminary investigations indicate that the improvements conferred by such an LLM-augmented framework are somewhat constrained. This limitation could potentially be ascribed to the fact that the generated codes, albeit functionally accurate, frequently display a pronounced stylistic deviation from the ground truth code in the codebase. In this paper, we extend the foundational GAR framework and propose a simple yet effective method that additionally Rewrites the Code (ReCo) within the codebase for style normalization. Experimental results demonstrate that ReCo significantly boosts retrieval accuracy across sparse (up to 35.7%), zero-shot dense (up to 27.6%), and fine-tuned dense (up to 23.6%) retrieval settings in diverse search scenarios. To further elucidate the advantages of ReCo and stimulate research in code style normalization, we introduce Code Style Similarity, the first metric tailored to quantify stylistic similarities in code. Notably, our empirical findings reveal the inadequacy of existing metrics in capturing stylistic nuances.</p>\n", "tags": ["search","large language models","metrics"] },
 {"key": "liguori2021shellcode_ia32", "year": "2021", "title":"Shellcode_IA32: A Dataset for Automatic Shellcode Generation", "abstract": "<p>We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability, starting from natural language comments. We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions. We experiment with standard methods in neural machine translation (NMT) to establish baseline performance levels on this task.</p>\n", "tags": ["code generation","dataset"] },
 {"key": "lin2017program", "year": "2017", "title":"Program Synthesis from Natural Language Using Recurrent Neural Networks", "abstract": "<p>Oftentimes, a programmer may have difficulty implementing a\ndesired operation. Even when the programmer can describe her\ngoal in English, it can be difficult to translate into code. Existing\nresources, such as question-and-answer websites, tabulate specific\noperations that someone has wanted to perform in the past, but\nthey are not effective in generalizing to new tasks, to compound\ntasks that require combining previous questions, or sometimes even\nto variations of listed tasks.</p>\n\n<p>Our goal is to make programming easier and more productive by\nletting programmers use their own words and concepts to express\nthe intended operation, rather than forcing them to accommodate\nthe machine by memorizing its grammar. We have built a system\nthat lets a programmer describe a desired operation in natural language, then automatically translates it to a programming language\nfor review and approval by the programmer. Our system, Tellina,\ndoes the translation using recurrent neural networks (RNNs), a\nstate-of-the-art natural language processing technique that we augmented with slot (argument) filling and other enhancements.</p>\n\n<p>We evaluated Tellina in the context of shell scripting. We trained\nTellina’s RNNs on textual descriptions of file system operations\nand bash one-liners, scraped from the web. Although recovering\ncompletely correct commands is challenging, Tellina achieves top-3\naccuracy of 80% for producing the correct command structure. In a\ncontrolled study, programmers who had access to Tellina outperformed those who did not, even when Tellina’s predictions were\nnot completely correct, to a statistically significant degree.</p>\n", "tags": ["bimodal","code generation"] },
 {"key": "lin2018nl2bash", "year": "2018", "title":"NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System", "abstract": "<p>We present new data and semantic parsing methods for the problem of mapping english sentences to Bash commands (NL2Bash). Our long-term goal is to enable any user to easily solve otherwise repetitive tasks (such as file manipulation, search, and application-specific scripting) by simply stating their intents in English. We take a first step in this domain, by providing a large new dataset of challenging but commonly used commands paired with their English descriptions, along with the baseline methods to establish performance levels on this task.</p>\n", "tags": ["bimodal","code generation"] },