ml4code
diff --git a/‎index.html‎
Lines changed: 2 additions & 0 deletions b/‎index.html‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -129,6 +129,7 @@ <h4 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h4>
 <tag><a href="/tags.html#code generation">code generation</a></tag>
 <tag><a href="/tags.html#code similarity">code similarity</a></tag>
 <tag><a href="/tags.html#compilation">compilation</a></tag>
+<tag><a href="/tags.html#completion">completion</a></tag>
 <tag><a href="/tags.html#dataset">dataset</a></tag>
 <tag><a href="/tags.html#decompilation">decompilation</a></tag>
 <tag><a href="/tags.html#defect">defect</a></tag>
@@ -159,6 +160,7 @@ <h4 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h4>
 <tag><a href="/tags.html#refactoring">refactoring</a></tag>
 <tag><a href="/tags.html#repair">repair</a></tag>
 <tag><a href="/tags.html#representation">representation</a></tag>
+<tag><a href="/tags.html#retrieval">retrieval</a></tag>
 <tag><a href="/tags.html#review">review</a></tag>
 <tag><a href="/tags.html#search">search</a></tag>
 <tag><a href="/tags.html#static analysis">static analysis</a></tag>
 
@@ -460,6 +460,7 @@
 {"key": "zhang2021bag", "year": "2021", "title":"Bag-of-Words Baselines for Semantic Code Search", "abstract": "<p>The task of semantic code search is to retrieve code snippets from a source code corpus based on an information need expressed in natural language. The semantic gap between natural language and programming languages has for long been regarded as one of the most significant obstacles to the effectiveness of keyword-based information retrieval (IR) methods. It is a common assumption that “traditional” bag-of-words IR methods are poorly suited for semantic code search: our work empirically investigates this assumption. Specifically, we examine the effectiveness of two traditional IR methods, namely BM25 and RM3, on the CodeSearchNet Corpus, which consists of natural language queries paired with relevant code snippets. We find that the two keyword-based methods outperform several pre-BERT neural models. We also compare several code-specific data pre-processing strategies and find that specialized tokenization improves effectiveness.</p>\n", "tags": ["search"] },
 {"key": "zhang2021disentangled.md", "year": "2021", "title":"Disentangled Code Representation Learning for Multiple Programming Languages", "abstract": "<p>Developing effective distributed representations of source code is fundamental yet challenging for many software engineering tasks such as code clone detection, code search, code translation and transformation. However, current code embedding approaches that represent the semantic and syntax of code in a mixed way are less interpretable and the resulting embedding can not be easily generalized across programming languages. In this paper, we propose a disentangled code representation learning approach to separate the semantic from the syntax of source code under a multi-programming-language setting, obtaining better interpretability and generalizability. Specially, we design three losses dedicated to the characteristics of source code to enforce the disentanglement effectively. We conduct comprehensive experiments on a real-world dataset composed of programming exercises implemented by multiple solutions that are semantically identical but grammatically distinguished. The experimental results validate the superiority of our proposed disentangled code representation, compared to several baselines, across three types of downstream tasks, i.e., code clone detection, code translation, and code-to-code search.</p>\n", "tags": ["representation"] },
 {"key": "zhang2022coditt5", "year": "2022", "title":"CoditT5: Pretraining for Source Code and Natural Language Editing", "abstract": "<p>Pretrained language models have been shown to be effective in many software-related generation tasks; however, they are not well-suited for editing tasks as they are not designed to reason about edits. To address this, we propose a novel pretraining objective which explicitly models edits and use it to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments. We fine-tune it on various downstream editing tasks, including comment updating, bug fixing, and automated code review. By outperforming pure generation-based models, we demonstrate the generalizability of our approach and its suitability for editing tasks. We also show how a pure generation model and our edit-based model can complement one another through simple reranking strategies, with which we achieve state-of-the-art performance for the three downstream editing tasks.</p>\n", "tags": ["Transformer","edit"] },
+{"key": "zhang2023repocoder", "year": "2023", "title":"RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation", "abstract": "<p>The task of repository-level code completion is to continue writing the unfinished code based on a broader context of the repository. While for automated code completion tools, it is difficult to utilize the useful information scattered in different files. We propose RepoCoder, a simple, generic, and effective framework to address the challenge. It streamlines the repository-level code completion process by incorporating a similarity-based retriever and a pre-trained code language model, which allows for the effective utilization of repository-level information for code completion and grants the ability to generate code at various levels of granularity. Furthermore, RepoCoder utilizes a novel iterative retrieval-generation paradigm that bridges the gap between retrieval context and the intended completion target. We also propose a new benchmark RepoEval, which consists of the latest and high-quality real-world repositories covering line, API invocation, and function body completion scenarios. We test the performance of RepoCoder by using various combinations of code retrievers and generators. Experimental results indicate that RepoCoder significantly improves the zero-shot code completion baseline by over 10% in all settings and consistently outperforms the vanilla retrieval-augmented code completion approach. Furthermore, we validate the effectiveness of RepoCoder through comprehensive analysis, providing valuable insights for future research.</p>\n", "tags": ["completion","Transformer","retrieval"] },
 {"key": "zhao2018neural", "year": "2018", "title":"Neural-Augumented Static Analysis of Android Communication", "abstract": "<p>We address the problem of discovering communication links between applications in the popular Android mobile operating system, an important problem for security and privacy in Android. Any scalable static analysis in this complex setting is bound to produce an excessive amount of false-positives, rendering it impractical. To improve precision, we propose to augment static analysis with a trained neural-network model that estimates the probability that a communication link truly exists. We describe a neural-network architecture that encodes abstractions of communicating objects in two applications and estimates the probability with which a link indeed exists. At the heart of our architecture are type-directed encoders (TDE), a general framework for elegantly constructing encoders of a compound data type by recursively composing encoders for its constituent types. We evaluate our approach on a large corpus of Android applications, and demonstrate that it achieves very high accuracy. Further, we conduct thorough interpretability studies to understand the internals of the learned neural networks.</p>\n", "tags": ["program analysis"] },
 {"key": "zhao2019neural", "year": "2019", "title":"Neural Networks for Modeling Source Code Edits", "abstract": "<p>Programming languages are emerging as a challenging and interesting domain for machine learning. A core task, which has received significant attention in recent years, is building generative models of source code. However, to our knowledge, previous generative models have always been framed in terms of generating static snapshots of code. In this work, we instead treat source code as a dynamic object and tackle the problem of modeling the edits that software developers make to source code files. This requires extracting intent from previous edits and leveraging it to generate subsequent edits. We develop several neural networks and use synthetic data to test their ability to learn challenging edit patterns that require strong generalization. We then collect and train our models on a large-scale dataset of Google source code, consisting of millions of fine-grained edits from thousands of Python developers. From the modeling perspective, our main conclusion is that a new composition of attentional and pointer network components provides the best overall performance and scalability. From the application perspective, our results provide preliminary evidence of the feasibility of developing tools that learn to predict future edits.</p>\n", "tags": ["edit"] },
 {"key": "zhong2018generating", "year": "2018", "title":"Generating Regular Expressions from Natural Language Specifications: Are We There Yet?", "abstract": "<p>Recent  state-of-the-art  approaches  automatically  generate\nregular  expressions  from  natural  language  specifications.\nGiven that these approaches use only synthetic data in both\ntraining datasets and validation/test datasets, a natural question arises: are these approaches effective to address various\nreal-world  situations?  To  explore  this  question,  in  this  paper, we conduct a characteristic study on comparing two synthetic datasets used by the recent research and a real-world\ndataset  collected  from  the  Internet,  and  conduct  an  experimental study on applying a state-of-the-art approach on the\nreal-world dataset. Our study results suggest the existence of\ndistinct characteristics between the synthetic datasets and the\nreal-world  dataset,  and  the  state-of-the-art  approach  (based\non  a  model  trained  from  a  synthetic  dataset)  achieves  extremely low effectiveness when evaluated on real-world data,\nmuch lower than the effectiveness when evaluated on the synthetic  dataset.  We  also  provide  initial  analysis  on  some  of\nthose challenging cases and discuss future directions.</p>\n", "tags": ["bimodal","code generation"] },