ml4code
diff --git a/‎index.html‎
Lines changed: 4 additions & 0 deletions b/‎index.html‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -119,6 +119,7 @@ <h3 id="machine-learning-on-source-code">Machine Learning on Source Code</h3>
 
 <h4 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h4>
 
+<tag><a href="/tags.html#Adapters">Adapters</a></tag>
 <tag><a href="/tags.html#adversarial">adversarial</a></tag>
 <tag><a href="/tags.html#API">API</a></tag>
 <tag><a href="/tags.html#autocomplete">autocomplete</a></tag>
@@ -127,7 +128,9 @@ <h4 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h4>
 <tag><a href="/tags.html#clone">clone</a></tag>
 <tag><a href="/tags.html#code completion">code completion</a></tag>
 <tag><a href="/tags.html#code generation">code generation</a></tag>
+<tag><a href="/tags.html#Code Refinement">Code Refinement</a></tag>
 <tag><a href="/tags.html#code similarity">code similarity</a></tag>
+<tag><a href="/tags.html#Code Summarization">Code Summarization</a></tag>
 <tag><a href="/tags.html#compilation">compilation</a></tag>
 <tag><a href="/tags.html#dataset">dataset</a></tag>
 <tag><a href="/tags.html#decompilation">decompilation</a></tag>
@@ -154,6 +157,7 @@ <h4 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h4>
 <tag><a href="/tags.html#naming">naming</a></tag>
 <tag><a href="/tags.html#optimization">optimization</a></tag>
 <tag><a href="/tags.html#pattern mining">pattern mining</a></tag>
+<tag><a href="/tags.html#Pre-trained Programming Language">Pre-trained Programming Language</a></tag>
 <tag><a href="/tags.html#pretraining">pretraining</a></tag>
 <tag><a href="/tags.html#program analysis">program analysis</a></tag>
 <tag><a href="/tags.html#refactoring">refactoring</a></tag>
 
@@ -348,6 +348,7 @@
 {"key": "roziere2021dobf", "year": "2021", "title":"DOBF: A Deobfuscation Pre-Training Objective for Programming Languages", "abstract": "<p>Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 13% in unsupervised code translation, and 24% in natural language code search. Incidentally, we found that our pre-trained model is able to de-obfuscate fully obfuscated source files, and to suggest descriptive variable names.</p>\n", "tags": ["pretraining"] },
 {"key": "roziere2021leveraging", "year": "2021", "title":"Leveraging Automated Unit Tests for Unsupervised Code Translation", "abstract": "<p>With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java → Python and Python → C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.</p>\n", "tags": ["migration"] },
 {"key": "russell2018automated", "year": "2018", "title":"Automated Vulnerability Detection in Source Code Using Deep Representation Learning", "abstract": "<p>Increasing numbers of software vulnerabilities are discovered every year whether they are reported publicly or discovered internally in proprietary code. These vulnerabilities can pose serious risk of exploit and result in system compromise, information leaks, or denial of service. We leveraged the wealth of C and C++ open-source code available to develop a large-scale function-level vulnerability detection system using machine learning. To supplement existing labeled vulnerability datasets, we compiled a vast dataset of millions of open-source functions and labeled it with carefully-selected findings from three different static analyzers that indicate potential exploits. Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. We evaluated our tool on code from both real software packages and the NIST SATE IV benchmark dataset. Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection.</p>\n", "tags": ["program analysis"] },
+{"key": "saberi2023model", "year": "2023", "title":"Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models", "abstract": "<p>Pre-trained Programming Language Models (PPLMs) achieved many recent states of the art results for many code-related software engineering tasks. Though some studies use data flow or propose tree-based models that utilize Abstract Syntax Tree (AST), most PPLMs do not fully utilize the rich syntactical information in source code. Still, the input is considered a sequence of tokens. There are two issues; the first is computational inefficiency due to the quadratic relationship between input length and attention complexity. Second, any syntactical information, when needed as an extra input to the current PPLMs, requires the model to be pre-trained from scratch, wasting all the computational resources already used for pre-training the current models. In this work, we propose Named Entity Recognition (NER) adapters, lightweight modules that can be inserted into Transformer blocks to learn type information extracted from the AST. These adapters can be used with current PPLMs such as CodeBERT, GraphCodeBERT, and CodeT5. We train the NER adapters using a novel Token Type Classification objective function (TTC). We insert our proposed work in CodeBERT, building CodeBERTER, and evaluate the performance on two tasks of code refinement and code summarization. CodeBERTER improves the accuracy of code refinement from 16.4 to 17.8 while using 20% of training parameter budget compared to the fully fine-tuning approach, and the BLEU score of code summarization from 14.75 to 15.90 while reducing 77% of training parameters compared to the fully fine-tuning approach.</p>\n", "tags": ["Adapters","Pre-trained Programming Language","Code Refinement","Code Summarization"] },
 {"key": "sahu2022learning", "year": "2022", "title":"Learning to Answer Semantic Queries over Code", "abstract": "<p>During software development, developers need answers to queries about semantic aspects of code. Even though extractive question-answering using neural approaches has been studied widely in natural languages, the problem of answering semantic queries over code using neural networks has not yet been explored. This is mainly because there is no existing dataset with extractive question and answer pairs over code involving complex concepts and long chains of reasoning. We bridge this gap by building a new, curated dataset called CodeQueries, and proposing a neural question-answering methodology over code.\nWe build upon state-of-the-art pre-trained models of code to predict answer and supporting-fact spans. Given a query and code, only some of the code may be relevant to answer the query. We first experiment under an ideal setting where only the relevant code is given to the model and show that our models do well. We then experiment under three pragmatic considerations: (1) scaling to large-size code, (2) learning from a limited number of examples and (3) robustness to minor syntax errors in code. Our results show that while a neural model can be resilient to minor syntax errors in code, increasing size of code, presence of code that is not relevant to the query, and reduced number of training examples limit the model performance. We are releasing our data and models to facilitate future work on the proposed problem of answering semantic queries over code.</p>\n", "tags": ["static analysis","Transformer"] },
 {"key": "saini2018oreo", "year": "2018", "title":"Oreo: detection of clones in the twilight zone", "abstract": "<p>Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect – the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner.</p>\n", "tags": ["clone"] },
 {"key": "santos2018syntax", "year": "2018", "title":"Syntax and Sensibility: Using language models to detect and correct syntax errors", "abstract": "<p>Syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers are of little help, typically resolving syntax errors and their precise location poorly. We propose a methodology that locates where syntax errors occur, and suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by using language models trained on correct source code to find tokens that seem out of place. Fixes are synthesized by consulting the language models to determine what tokens are more likely at the estimated error location. We compare <em>n</em>-gram and LSTM (long short-term memory) language models for this task, each trained on a large corpus of Java code collected from GitHub. Unlike prior work, our methodology does not rely that the problem source code comes from the same domain as the training data. We evaluated against a repository of real student mistakes. Our tools are able to find a syntactically-valid fix within its top-2 suggestions, often producing the exact fix that the student used to resolve the error. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.</p>\n", "tags": ["repair","language model"] },