|
6 | 6 | {"key": "ahmad2021unified", "year": "2021", "title":"Unified Pre-training for Program Understanding and Generation", "abstract": "<p>Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on language generation tasks, including code summarization, generation, translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., if block inside an else block is equivalent to else if block) that are crucial to program semantics and thus excels even with limited annotations.</p>\n", "tags": ["pretraining","Transformer"] }, |
7 | 7 | {"key": "ahmed2019learning", "year": "2019", "title":"Learning Lenient Parsing & Typing via Indirect Supervision", "abstract": "<p>Both professional coders and teachers frequently deal with imperfect (fragmentary, incomplete, ill-formed) code. Such fragments are common in StackOverflow; students also frequently produce ill-formed code, for which instructors, TAs (or students themselves) must find repairs. In either case, the developer experience could be greatly improved if such code could somehow be parsed & typed; this makes them more amenable to use within IDEs and allows early detection and repair of potential errors. We introduce a lenient parser, which can parse & type fragments, even ones with simple errors. Training a machine learner to leniently parse & type imperfect code requires a large training set of pairs of imperfect code and its repair (and/or type information); such training sets are limited by human effort and curation. In this paper, we present a novel indirectly supervised approach to train a lenient parser, without access to such human-curated training data. We leverage the huge corpus of mostly correct code available on Github, and the massive, efficient learning capacity of Transformer-based NN architectures. Using GitHub data, we first create a large dataset of fragments of code and corresponding tree fragments and type annotations; we then randomly corrupt the input fragments (while requiring correct output) by seeding errors that mimic corruptions found in StackOverflow and student data. Using this data, we train high-capacity transformer models to overcome both fragmentation and corruption. With this novel approach, we can achieve reasonable performance on parsing & typing StackOverflow fragments; we also demonstrate that our approach achieves best-in-class performance on a large dataset of student errors.</p>\n", "tags": ["types"] }, |
8 | 8 | {"key": "ahmed2022learning", "year": "2022", "title":"Learning code summarization from a small and local dataset", "abstract": "<p>Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.</p>\n", "tags": ["Transformer","summarization"] }, |
| 9 | +{"key": "ahmed2033improving", "year": "2023", "title":"Improving Few-Shot Prompts with Relevant Static Analysis Products", "abstract": "<p>Large Language Models (LLM) are a new class of computation engines, “programmed” via prompt engineering. We are still learning how to best “program” these LLMs to help developers. We start with the intuition that developers tend to consciously and unconsciously have a collection of semantics facts in mind when working on coding tasks. Mostly these are shallow, simple facts arising from a quick read. For a function, examples of facts might include parameter and local variable names, return expressions, simple pre- and post-conditions, and basic control and data flow, etc.</p>\n\n<p>One might assume that the powerful multi-layer architecture of transformer-style LLMs makes them inherently capable of doing this simple level of “code analysis” and extracting such information, implicitly, while processing code: but are they, really? If they aren’t, could explicitly adding this information help? Our goal here is to investigate this question, using the code summarization task and evaluate whether automatically augmenting an LLM’s prompt with semantic facts explicitly, actually helps.</p>\n\n<p>Prior work shows that LLM performance on code summarization benefits from few-shot samples drawn either from the same-project or from examples found via information retrieval methods (such as BM25). While summarization performance has steadily increased since the early days, there is still room for improvement: LLM performance on code summarization still lags its performance on natural-language tasks like translation and text summarization.</p>\n\n<p>We find that adding semantic facts actually does help! This approach improves performance in several different settings suggested by prior work, including for two different Large Language Models. In most cases, improvement nears or exceeds 2 BLEU; for the PHP language in the challenging CodeSearchNet dataset, this augmentation actually yields performance surpassing 30 BLEU.</p>\n", "tags": ["summarization","Transformer"] }, |
9 | 10 | {"key": "alet2021largescale", "year": "2021", "title":"A large-scale benchmark for few-shot program induction and synthesis", "abstract": "<p>A landmark challenge for AI is to learn flexible, powerful representations from small numbers of examples. \nOn an important class of tasks, hypotheses in the form of programs provide extreme generalization capabilities from surprisingly few examples. However, whereas large natural few-shot learning image benchmarks have spurred progress in meta-learning for deep networks, there is no comparably big, natural program-synthesis dataset that can play a similar role. This is because, whereas images are relatively easy to label from internet meta-data or annotated by non-experts, generating meaningful input-output examples for program induction has proven hard to scale. In this work, we propose a new way of leveraging unit tests and natural inputs for small programs as meaningful input-output examples for each sub-program of the overall program. This allows us to create a large-scale naturalistic few-shot program-induction benchmark and propose new challenges in this domain. The evaluation of multiple program induction and synthesis algorithms points to shortcomings of current methods and suggests multiple avenues for future work.</p>\n", "tags": ["dataset","synthesis"] }, |
10 | 11 | {"key": "allal2022santacoder", "year": "2022", "title":"SantaCoder: don’t reach for the stars!", "abstract": "<p>The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.1 This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII)\nredaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java,\nJavaScript, and Python subsets of The Stack (Kocetkov et al., 2022) and\nevaluate the models on MultiPL-E (Cassano et al., 2022), a text2code\nbenchmark available in 18 programming languages. We find that more\naggressive filtering of near-duplicates can further boost performance and,\nsurprisingly, that selecting files from repositories with 5+ GitHub stars\ndeteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and\nCodeGen-Multi-2.7B) in both left-to-right generation and infilling on the\nJava, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL\nlicense at https://hf.co/bigcode</p>\n", "tags": ["Transformer"] }, |
11 | 12 | {"key": "allamanis2013mining", "year": "2013", "title":"Mining Source Code Repositories at Massive Scale Using Language Modeling ", "abstract": "<p>The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle et al. The giga-token model is significantly better at the code suggestion task than previous models. More broadly, our approach provides a new “lens” for analyzing software projects, enabling new complexity metrics based on statistical analysis of large corpora. We call these metrics data-driven complexity metrics. We propose new metrics that measure the complexity of a code module and the topical centrality of a module to a software project. In particular, it is possible to distinguish reusable utility classes from classes that are part of a program’s core logic based solely on general information theoretic criteria.</p>\n", "tags": ["language model"] }, |
|
0 commit comments