|
416 | 416 | {"key": "svyatkovskiy2020intellicode", "year": "2020", "title":"IntelliCode Compose: Code Generation Using Transformer", "abstract": "<p>In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments.\nIn this paper, we introduce IntelliCode Compose − a general-purpose multilingual code completion tool which is capable of predicting sequences of code tokens of arbitrary types, generating up to entire lines of syntactically correct code. It leverages state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript and TypeScript programming languages. IntelliCode Compose is deployed as a cloud-based web service. It makes use of client-side tree-based caching, efficient parallel implementation of the beam search decoder, and compute graph optimizations to meet edit-time completion suggestion requirements in the Visual Studio Code IDE and Azure Notebook.\nOur best model yields an average edit similarity of 86.7% and a perplexity of 1.82 for Python programming language.</p>\n", "tags": ["autocomplete","code generation","synthesis","language model","pretraining"] }, |
417 | 417 | {"key": "szafraniec2022code", "year": "2022", "title":"Code Translation with Compiler Representations", "abstract": "<p>In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java - Rust pair. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.</p>\n", "tags": ["Transformer","migration","decompilation"] }, |
418 | 418 | {"key": "tabassum2020code", "year": "2020", "title":"Code and Named Entity Recognition in StackOverflow", "abstract": "<p>There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F-1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F1 score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model.</p>\n", "tags": ["dataset","information extraction"] }, |
| 419 | +{"key": "tan2024llm4decompile", "year": "2024", "title":"LLM4Decompile: Decompiling Binary Code with Large Language Models", "abstract": "<p>Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4. Our code, dataset, and models are released at this <a href=\"https://github.com/albertan017/LLM4Decompile\">https URL</a></p>\n", "tags": ["decompilation","translation","evaluation","large language models","LLM"] }, |
419 | 420 | {"key": "tarlow2019learning", "year": "2019", "title":"Learning to Fix Build Errors with Graph2Diff Neural Networks", "abstract": "<p>Professional software developers spend a significant amount oftime fixing builds, but this has received little attention as a prob-lem in automatic program repair. We present a new deep learningarchitecture, called Graph2Diff, for automatically localizing andfixing build errors. We represent source code, build configurationfiles, and compiler diagnostic messages as a graph, and then use aGraph Neural Network model to predict a diff. A diff specifies howto modify the code’s abstract syntax tree, represented in the neuralnetwork as a sequence of tokens and of pointers to code locations.Our network is an instance of a more general abstraction which wecall Graph2Tocopo, which is potentially useful in any developmenttool for predicting source code changes. We evaluate the model ona dataset of over 500k real build errors and their resolutions fromprofessional developers. Compared to the approach of DeepDelta, our approach tackles the harder task of predicting a moreprecise diff but still achieves over double the accuracy.</p>\n", "tags": ["edit","repair"] }, |
420 | 421 | {"key": "theeten2019import2vec", "year": "2019", "title":"Import2vec - Learning Embeddings for Software Libraries", "abstract": "<p>We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning tasks (e.g. classification) or applications such as contextual search and analogical reasoning.</p>\n\n<p>We apply word embedding techniques from natural language processing (NLP) to train embeddings for library packages (“library vectors”). Library vectors represent libraries by similar context of use as determined by import statements present in source code. Experimental results obtained from training such embeddings on three large open source software corpora reveals that library vectors capture semantically meaningful relationships among software libraries, such as the relationship between frameworks and their plug-ins and libraries commonly used together within ecosystems such as big data infrastructure projects (in Java), front-end and back-end web development frameworks (in JavaScript) and data science toolkits (in Python).</p>\n", "tags": ["representation"] }, |
421 | 422 | {"key": "tian2020evaluating", "year": "2020", "title":"Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair", "abstract": "<p>A large body of the literature of automated program repair develops approaches where patches are generated to be validated against an oracle (e.g., a test suite). Because such an oracle can be imperfect, the generated patches, although validated by the oracle, may actually be incorrect. While the state of the art explore research directions that require dynamic information or rely on manually-crafted heuristics, we study the benefit of learning code representations to learn deep features that may encode the properties of patch correctness. Our work mainly investigates different representation learning approaches for code changes to derive embeddings that are amenable to similarity computations. We report on findings based on embeddings produced by pre-trained and re-trained neural networks. Experimental results demonstrate the potential of embeddings to empower learning algorithms in reasoning about patch correctness: a machine learning predictor with BERT transformer-based embeddings associated with logistic regression yielded an AUC value of about 0.8 in predicting patch correctness on a deduplicated dataset of 1000 labeled patches. Our study shows that learned representations can lead to reasonable performance when comparing against the state-of-the-art, PATCH-SIM, which relies on dynamic information. These representations may further be complementary to features that were carefully (manually) engineered in the literature.</p>\n", "tags": ["repair","Transformer"] }, |
|
0 commit comments