ml4code
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -192,6 +192,7 @@
 {"key": "kharkar2022learning", "year": "2022", "title":"Learning to Reduce False Positives in Analytic Bug Detectors", "abstract": "<p>Due to increasingly complex software design and rapid iterative development, code defects and security vulnerabilities are prevalent in modern software. In response, programmers rely on static analysis tools to regularly scan their codebases and find potential bugs. In order to maximize coverage, however, these tools generally tend to report a significant number of false positives, requiring developers to manually verify each warning. To address this problem, we propose a Transformer-based learning approach to identify false positive bug warnings. We demonstrate that our models can improve the precision of static analysis by 17.5%. In addition, we validated the generalizability of this approach across two major bug types: null dereference and resource leak.</p>\n", "tags": ["Transformer","static analysis"] },
 {"key": "kim2020code", "year": "2020", "title":"Code Prediction by Feeding Trees to Transformers", "abstract": "<p>In this paper, we describe how to leverage Transformer, a recent neural architecture for learning from sequential data (such as text), for code completion. As in the realm of natural language processing, Transformers surpass the prediction accuracy achievable by RNNs; we provide an experimental confirmation of this over a Python dataset.</p>\n\n<p>Furthermore, we show that the way to obtain even better accuracy from Transformers is to expose the syntactic structure of code, which is easily recovered by parsing, to the neural network. This works significantly better than presenting the code as a linear token sequence, which is how Transformers were originally intended to be used.</p>\n\n<p>To accomplish this, we propose a novel enhancement to the self-attention mechanism of the Transformer. We enable the mechanism to learn weights—that is, how much to focus on each preceding token in the input—not only on the basis of a token’s value, but also on the basis of the spatial relationships, as in their positions in the abstract syntax tree, between each pair of tokens.</p>\n\n<p>We provide comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on a Python corpus internal to Facebook.</p>\n", "tags": ["autocomplete"] },
 {"key": "koc2017learning", "year": "2017", "title":"Learning a Classifier for False Positive Error Reports Emitted by Static Code Analysis Tools", "abstract": "<p>The large scale and high complexity of modern software systems\nmake perfectly precise static code analysis (SCA) infeasible. Therefore SCA tools often over-approximate, so not to miss any real\nproblems. This, however, comes at the expense of raising false\nalarms, which, in practice, reduces the usability of these tools.</p>\n\n<p>To partially address this problem, we propose a novel learning\nprocess whose goal is to discover program structures that cause\na given SCA tool to emit false error reports, and then to use this\ninformation to predict whether a new error report is likely to be a\nfalse positive as well. To do this, we first preprocess code to isolate\nthe locations that are related to the error report. Then, we apply\nmachine learning techniques to the preprocessed code to discover\ncorrelations and to learn a classifier.</p>\n\n<p>We evaluated this approach in an initial case study of a widely-used SCA tool for Java. Our results showed that for our dataset\nwe could accurately classify a large majority of false positive error\nreports. Moreover, we identified some common coding patterns that\nled to false positive errors. We believe that SCA developers may be\nable to redesign their methods to address these patterns and reduce\nfalse positive error reports.</p>\n", "tags": ["static analysis"] },
+{"key": "kocetkov2022stack", "year": "2022", "title":"The Stack: 3TB of permissively licensed source code", "abstract": "<p>Large Language Models (LLMs) play an ever-increasing role in the field of\nArtificial Intelligence (AI)–not only for natural language processing but also\nfor code understanding and generation. To stimulate open and responsible\nresearch on LLMs for code, we introduce The Stack, a 3.1 TB dataset\nconsisting of permissively licensed source code in 30 programming languages.\nWe describe how we collect the full dataset, construct a permissively licensed\nsubset, and present promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that\n(1) near-deduplicating the data significantly boosts performance across all\nexperiments, and (2) it is possible to match previously reported HumanEval\nand MBPP performance using only permissively licensed data. We make the\ndataset available at https://hf.co/BigCode and give developers the possi-\nbility to have their code removed from the dataset by following the instruc-\ntions at https://www.bigcode-project.org/docs/about/the-stack/.</p>\n", "tags": ["dataset"] },
 {"key": "korbak2021energy", "year": "2021", "title":"Energy-Based Models for Code Generation under Compilability Constraints", "abstract": "<p>Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data such as syntactic correctness or compilability. In this work, we pose the problem of learning to generate compilable code as constraint satisfaction. We define an Energy-Based Model (EBM) representing a pre-trained generative model with an imposed constraint of generating only compilable sequences. We then use the KL-Adaptive Distributional Policy Gradient algorithm (Khalifa et al., 2021) to train a generative model approximating the EBM. We conduct experiments showing that our proposed approach is able to improve compilability rates without sacrificing diversity and complexity of the generated samples.</p>\n", "tags": ["code generation"] },
 {"key": "kovalenko2019pathminer", "year": "2019", "title":"PathMiner : A Library for Mining of Path-Based Representations of Code", "abstract": "<p>One recent, significant advance in modeling source code for machine learning algorithms has been the introduction of path-based representation – an approach consisting in representing a snippet of code as a collection of paths from its syntax tree. Such representation efficiently captures the structure of code, which, in turn, carries its semantics and other information.\nBuilding the path-based representation involves parsing the code and extracting the paths from its syntax tree; these steps build up to a substantial technical job. With no common reusable toolkit existing for this task, the burden of mining diverts the focus of researchers from the essential work and hinders newcomers in the field of machine learning on code.</p>\n\n<p>In this paper, we present PathMiner – an open-source library for mining path-based representations of code. PathMiner is fast, flexible, well-tested, and easily extensible to support input code in any common programming language. Preprint [https://doi.org/10.5281/zenodo.2595271]; released tool [https://doi.org/10.5281/zenodo.2595257].</p>\n", "tags": ["representation","grammar"] },
 {"key": "kremenek2007factor", "year": "2007", "title":"A Factor Graph Model for Software Bug Finding", "abstract": "<p>Automatic tools for finding software errors require\nknowledge of the rules a program must obey, or\n“specifications,” before they can identify bugs. We\npresent a method that combines factor graphs and\nstatic program analysis to automatically infer specifications directly from programs. We illustrate the\napproach on inferring functions in C programs that\nallocate and release resources, and evaluate the approach on three codebases: SDL, OpenSSH, and\nthe OS kernel for Mac OS X (XNU). The inferred\nspecifications are highly accurate and with them we\nhave discovered numerous bugs.</p>\n\n", "tags": ["program analysis"] },