ml4code
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -330,6 +330,7 @@
 {"key": "roziere2021dobf", "year": "2021", "title":"DOBF: A Deobfuscation Pre-Training Objective for Programming Languages", "abstract": "<p>Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 13% in unsupervised code translation, and 24% in natural language code search. Incidentally, we found that our pre-trained model is able to de-obfuscate fully obfuscated source files, and to suggest descriptive variable names.</p>\n", "tags": ["pretraining"] },
 {"key": "roziere2021leveraging", "year": "2021", "title":"Leveraging Automated Unit Tests for Unsupervised Code Translation", "abstract": "<p>With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive to small changes; a single token can result in compilation failures or erroneous programs, unlike natural languages where small inaccuracies may not change the meaning of a sentence. To address this issue, we propose to leverage an automated unit-testing system to filter out invalid translations, thereby creating a fully tested parallel corpus. We found that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated, comfortably outperforming the state-of-the-art for all language pairs studied. In particular, for Java → Python and Python → C++ we outperform the best previous methods by more than 16% and 24% respectively, reducing the error rate by more than 35%.</p>\n", "tags": ["migration"] },
 {"key": "russell2018automated", "year": "2018", "title":"Automated Vulnerability Detection in Source Code Using Deep Representation Learning", "abstract": "<p>Increasing numbers of software vulnerabilities are discovered every year whether they are reported publicly or discovered internally in proprietary code. These vulnerabilities can pose serious risk of exploit and result in system compromise, information leaks, or denial of service. We leveraged the wealth of C and C++ open-source code available to develop a large-scale function-level vulnerability detection system using machine learning. To supplement existing labeled vulnerability datasets, we compiled a vast dataset of millions of open-source functions and labeled it with carefully-selected findings from three different static analyzers that indicate potential exploits. Using these datasets, we developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. We evaluated our tool on code from both real software packages and the NIST SATE IV benchmark dataset. Our results demonstrate that deep feature representation learning on source code is a promising approach for automated software vulnerability detection.</p>\n", "tags": ["program analysis"] },
+{"key": "sahu2022learning", "year": "2022", "title":"Learning to Answer Semantic Queries over Code", "abstract": "<p>During software development, developers need answers to queries about semantic aspects of code. Even though extractive question-answering using neural approaches has been studied widely in natural languages, the problem of answering semantic queries over code using neural networks has not yet been explored. This is mainly because there is no existing dataset with extractive question and answer pairs over code involving complex concepts and long chains of reasoning. We bridge this gap by building a new, curated dataset called CodeQueries, and proposing a neural question-answering methodology over code.\nWe build upon state-of-the-art pre-trained models of code to predict answer and supporting-fact spans. Given a query and code, only some of the code may be relevant to answer the query. We first experiment under an ideal setting where only the relevant code is given to the model and show that our models do well. We then experiment under three pragmatic considerations: (1) scaling to large-size code, (2) learning from a limited number of examples and (3) robustness to minor syntax errors in code. Our results show that while a neural model can be resilient to minor syntax errors in code, increasing size of code, presence of code that is not relevant to the query, and reduced number of training examples limit the model performance. We are releasing our data and models to facilitate future work on the proposed problem of answering semantic queries over code.</p>\n", "tags": ["static analysis","Transformer"] },
 {"key": "saini2018oreo", "year": "2018", "title":"Oreo: detection of clones in the twilight zone", "abstract": "<p>Source code clones are categorized into four types of increasing difficulty of detection, ranging from purely textual (Type-1) to purely semantic (Type-4). Most clone detectors reported in the literature work well up to Type-3, which accounts for syntactic differences. In between Type-3 and Type-4, however, there lies a spectrum of clones that, although still exhibiting some syntactic similarities, are extremely hard to detect – the Twilight Zone. Most clone detectors reported in the literature fail to operate in this zone. We present Oreo, a novel approach to source code clone detection that not only detects Type-1 to Type-3 clones accurately, but is also capable of detecting harder-to-detect clones in the Twilight Zone. Oreo is built using a combination of machine learning, information retrieval, and software metrics. We evaluate the recall of Oreo on BigCloneBench, and perform manual evaluation for precision. Oreo has both high recall and precision. More importantly, it pushes the boundary in detection of clones with moderate to weak syntactic similarity in a scalable manner.</p>\n", "tags": ["clone"] },
 {"key": "santos2018syntax", "year": "2018", "title":"Syntax and Sensibility: Using language models to detect and correct syntax errors", "abstract": "<p>Syntax errors are made by novice and experienced programmers alike; however, novice programmers lack the years of experience that help them quickly resolve these frustrating errors. Standard LR parsers are of little help, typically resolving syntax errors and their precise location poorly. We propose a methodology that locates where syntax errors occur, and suggests possible changes to the token stream that can fix the error identified. This methodology finds syntax errors by using language models trained on correct source code to find tokens that seem out of place. Fixes are synthesized by consulting the language models to determine what tokens are more likely at the estimated error location. We compare <em>n</em>-gram and LSTM (long short-term memory) language models for this task, each trained on a large corpus of Java code collected from GitHub. Unlike prior work, our methodology does not rely that the problem source code comes from the same domain as the training data. We evaluated against a repository of real student mistakes. Our tools are able to find a syntactically-valid fix within its top-2 suggestions, often producing the exact fix that the student used to resolve the error. The results show that this tool and methodology can locate and suggest corrections for syntax errors. Our methodology is of practical use to all programmers, but will be especially useful to novices frustrated with incomprehensible syntax errors.</p>\n", "tags": ["repair","language model"] },
 {"key": "saraiva2015products", "year": "2015", "title":"Products, Developers, and Milestones: How Should I Build My N-Gram Language Model", "abstract": "<p>Recent work has shown that although programming languages en-\nable source code to be rich and complex, most code tends to be\nrepetitive and predictable. The use of natural language processing\n(NLP) techniques applied to source code such as n-gram language\nmodels show great promise in areas such as code completion, aiding impaired developers, and code search. In this paper, we address\nthree questions related to different methods of constructing lan-\nguage models in an industrial context. Specifically, we ask: (1) Do\napplication specific, but smaller language models perform better\nthan language models across applications? (2) Are developer specific language models effective and do they differ depending on\nwhat parts of the codebase a developer is working in? (3) Finally,\ndo language models change over time, i.e., does a language model\nfrom early development model change later on in development?\nThe answers to these questions enable techniques that make use of\nprogramming language models in development to choose the model\ntraining corpus more effectively.</p>\n\n<p>We evaluate these questions by building 28 language models across\ndevelopers, time periods, and applications within Microsoft Office\nand present the results in this paper. We find that developer and\napplication specific language models perform better than models\nfrom the entire codebase, but that temporality has little to no effect\non language model performance.</p>\n", "tags": ["language model"] },