|
188 | 188 | {"key": "karampatsis2020big", "year": "2020", "title":"Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code", "abstract": "<p>Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.</p>\n", "tags": ["language model"] }, |
189 | 189 | {"key": "karampatsis2020scelmo", "year": "2020", "title":"SCELMo: Source Code Embeddings from Language Models", "abstract": "<p>Continuous embeddings of tokens in computer programs have been used to support a variety of software development tools, including readability, code search, and program repair. Contextual embeddings are common in natural language processing but have not been previously applied in software engineering. We introduce a new set of deep contextualized word representations for computer programs based on language models. We train a set of embeddings using the ELMo (embeddings from language models) framework of Peters et al (2018). We investigate whether these embeddings are effective when fine-tuned for the downstream task of bug detection. We show that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection.</p>\n", "tags": ["pretraining","defect"] }, |
190 | 190 | {"key": "karmakar2021what", "year": "2021", "title":"What do pre-trained code models know about code?", "abstract": "<p>Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from these pre-trained models comprehensively encode characteristics of source code well enough to be applicable to a broad spectrum of downstream tasks remains an open question.</p>\n\n<p>One way to investigate this is with diagnostic tasks called probes. In this paper, we construct four probing tasks (probing for surface-level, syntactic, structural, and semantic information) for pre-trained code models. We show how probes can be used to identify whether models are deficient in (understanding) certain code properties, characterize different model layers, and get insight into the model sample-efficiency.</p>\n\n<p>We probe four models that vary in their expected knowledge of code properties: BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow). While GraphCodeBERT performs more consistently overall, we find that BERT performs surprisingly well on some code tasks, which calls for further investigation.</p>\n", "tags": ["Transformer"] }, |
| 191 | +{"key": "karmakar2022jemma", "year": "2022", "title":"JEMMA: An Extensible Java Dataset for ML4Code Applications", "abstract": "<p>Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code’s richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.</p>\n", "tags": ["dataset"] }, |
191 | 192 | {"key": "karpathy2015visualizing", "year": "2015", "title":"Visualizing and Understanding Recurrent Networks", "abstract": "<p>Recurrent Neural Networks (RNNs), and specifically a variant with Long Short-Term Memory (LSTM), are enjoying renewed interest as a result of successful\napplications in a wide range of machine learning problems that involve sequential\ndata. However, while LSTMs provide exceptional results in practice, the source\nof their performance and their limitations remain rather poorly understood. Using character-level language models as an interpretable testbed, we aim to bridge\nthis gap by providing an analysis of their representations, predictions and error\ntypes. In particular, our experiments reveal the existence of interpretable cells that\nkeep track of long-range dependencies such as line lengths, quotes and brackets.\nMoreover, our comparative analysis with finite horizon n-gram models traces the\nsource of the LSTM improvements to long-range structural dependencies. Finally,\nwe provide analysis of the remaining errors and suggests areas for further study.</p>\n\n", "tags": ["language model","code generation"] }, |
192 | 193 | {"key": "katz2019towards", "year": "2019", "title":"Towards Neural Decompilation", "abstract": "<p>We address the problem of automatic decompilation, converting a program in low-level representation back to a higher-level human-readable programming language. The problem of decompilation is extremely important for security researchers. Finding vulnerabilities and understanding how malware operates is much easier when done over source code.</p>\n\n<p>The importance of decompilation has motivated the construction of hand-crafted rule-based decompilers. Such decompilers have been designed by experts to detect specific control-flow structures and idioms in low-level code and lift them to source level. The cost of supporting additional languages or new language features in these models is very high.</p>\n\n<p>We present a novel approach to decompilation based on neural machine translation. The main idea is to automatically learn a decompiler from a given compiler. Given a compiler from a source language S to a target language T , our approach automatically trains a decompiler that can translate (decompile) T back to S . We used our framework to decompile both LLVM IR and x86 assembly to C code with high success rates. Using our LLVM and x86 instantiations, we were able to successfully decompile over 97% and 88% of our benchmarks respectively.</p>\n", "tags": ["decompilation"] }, |
193 | 194 | {"key": "key2022speak", "year": "2022", "title":"I Speak, You Verify: Toward Trustworthy Neural Program Synthesis", "abstract": "<p>We develop an approach for improving the trustworthiness and overall accuracy of program synthesizers based on large language models for source code. Given a natural language description of a programming problem, our method samples both candidate programs as well as candidate predicates specifying how the program should behave. We learn to analyze the agreement between programs and predicates to judge both which program is most likely to be correct, and also judge whether the language model is able to solve the programming problem in the first place. This latter capacity allows favoring high precision over broad recall: fostering trust by only proposing a program when the system is certain that it is correct.</p>\n", "tags": ["synthesis"] }, |
|
0 commit comments