ml4code
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -110,6 +110,7 @@
 {"key": "edelmann2019neural", "year": "2019", "title":"Neural-Network Guided Expression Transformation", "abstract": "<p>Optimizing compilers, as well as other translator systems, often work by rewriting expressions according to equivalence preserving rules. Given an input expression and its optimized form, finding the sequence of rules that were applied is a non-trivial task. Most of the time, the tools provide no proof, of any kind, of the equivalence between the original expression and its optimized form. In this work, we propose to reconstruct proofs of equivalence of simple mathematical expressions, after the fact, by finding paths of equivalence preserving transformations between expressions. We propose to find those sequences of transformations using a search algorithm, guided by a neural network heuristic. Using a Tree-LSTM recursive neural network, we learn a distributed representation of expressions where the Manhattan distance between vectors approximately corresponds to the rewrite distance between expressions. We then show how the neural network can be efficiently used to search for transformation paths, leading to substantial gain in speed compared to an uninformed exhaustive search. In one of our experiments, our neural-network guided search algorithm is able to solve more instances with a 2 seconds timeout per instance than breadth-first search does with a 5 minutes timeout per instance.</p>\n", "tags": ["optimization","grammar"] },
 {"key": "ederhardt2019unsupervised", "year": "2019", "title":"Unsupervised Learning of API Aliasing Specifications", "abstract": "<p>Real world applications make heavy use of powerful libraries\nand frameworks, posing a significant challenge for static analysis\nas the library implementation may be very complex or unavailable.\nThus, obtaining specifications that summarize the behaviors of\nthe library is important as it enables static analyzers to precisely\ntrack the effects of APIs on the client program, without requiring\nthe actual API implementation.</p>\n\n<p>In this work, we propose a novel method\nfor discovering aliasing specifications of APIs by learning from a large\ndataset of programs. Unlike prior work, our method does not require\nmanual annotation, access to the library’s source code or ability to\nrun its APIs. Instead, it learns specifications in a fully unsupervised manner,\nby statically observing usages of APIs in the dataset. The core idea is to\nlearn a probabilistic model of interactions between API methods and aliasing\nobjects, enabling identification of additional likely aliasing relations,\nand to then infer aliasing specifications ofAPIs that explain these relations.\nThe learned specifications are then used to augment an API-aware points-to analysis.</p>\n\n<p>We implemented our approach in a tool called USpec and used it to automatically\nlearn aliasing specifications from millions of source code files.\nUSpec learned over 2000 specifications of various Java and Python APIs, in the process\nimproving the results of the points-to analysis and its clients.</p>\n", "tags": ["API","program analysis"] },
 {"key": "efstathiou2019semantic", "year": "2019", "title":"Semantic Source Code Models Using Identifier Embeddings", "abstract": "<p>The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in line with recent advances in machine learning research, software maintenance activities are switching from symbolic formal methods to data-driven methods. In this context, the rich semantics hidden in source code identifiers provide opportunities for building semantic representations of code which can assist tasks of code search and reuse. To this end, we deliver in the form of pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#. The models are produced using fastText, a state-of-the-art library for learning word representations. Each model is trained on data from a single programming language; the code mined for producing all models amounts to over 13.000 repositories. We indicate dissimilarities between natural language and source code, as well as variations in coding conventions in between the different programming languages we processed. We describe how these heterogeneities guided the data preprocessing decisions we took and the selection of the training parameters in the released models. Finally, we propose potential applications of the models and discuss limitations of the models.</p>\n", "tags": ["representation"] },
+{"key": "eghbali2022crystalbleu", "year": "2022", "title":"CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code", "abstract": "<p>Recent years have brought a surge of work on predicting pieces\nof source code, e.g., for code completion, code migration, program\nrepair, or translating natural language into code. All this work faces\nthe challenge of evaluating the quality of a prediction w.r.t. some\noracle, typically in the form of a reference solution. A common\nevaluation metric is the BLEU score, an n-gram-based metric originally proposed for evaluating natural language translation, but\nadopted in software engineering because it can be easily computed\non any programming language and enables automated evaluation at\nscale. However, a key difference between natural and programming\nlanguages is that in the latter, completely unrelated pieces of code\nmay have many common n-grams simply because of the syntactic\nverbosity and coding conventions of programming languages. We\nobserve that these trivially shared n-grams hamper the ability of\nthe metric to distinguish between truly similar code examples and\ncode examples that are merely written in the same language. This\npaper presents CrystalBLEU, an evaluation metric based on BLEU,\nthat allows for precisely and efficiently measuring the similarity of\ncode. Our metric preserves the desirable properties of BLEU, such\nas being language-agnostic, able to handle incomplete or partially\nincorrect code, and efficient, while reducing the noise caused by\ntrivially shared n-grams. We evaluate CrystalBLEU on two datasets\nfrom prior work and on a new, labeled dataset of semantically equivalent programs. Our results show that CrystalBLEU can distinguish\nsimilar from dissimilar code examples 1.9–4.5 times more effectively, when compared to the original BLEU score and a previously\nproposed variant of BLEU for code.</p>\n", "tags": ["evaluation"] },
 {"key": "ellis2021dreamcoder", "year": "2021", "title":"DreamCoder: bootstrapping inductive program synthesis with wake-sleep library learning", "abstract": "<p>We present a system for inductive program synthesis called DreamCoder, which inputs a corpus of synthesis problems each specified by one or a few examples, and automatically derives a library of program components and a neural search policy that can be used to efficiently solve other similar synthesis problems. The library and search policy bootstrap each other iteratively through a variant of “wake-sleep” approximate Bayesian learning. A new refactoring algorithm based on E-graph matching identifies common sub-components across synthesized programs, building a progressively deepening library of abstractions capturing the structure of the input domain. We evaluate on eight domains including classic program synthesis areas and AI tasks such as planning, inverse graphics, and equation discovery. We show that jointly learning the library and neural search policy leads to solving more problems, and solving them more quickly.</p>\n", "tags": ["synthesis","search"] },
 {"key": "elnaggar2021codetrans", "year": "2021", "title":"CodeTrans: Towards Cracking the Language of Silicon's Code Through Self-Supervised Deep Learning and High Performance Computing", "abstract": "<p>Currently, a growing number of mature natural language processing applications make people’s life more convenient. Such applications are built by source code - the language in software engineering. However, the applications for understanding source code language to ease the software engineering process are under-researched. Simultaneously, the transformer model, especially its combination with transfer learning, has been proven to be a powerful technique for natural language processing tasks. These breakthroughs point out a promising direction for process source code and crack software engineering tasks. This paper describes CodeTrans - an encoder-decoder transformer model for tasks in the software engineering domain, that explores the effectiveness of encoder-decoder transformer models for six software engineering tasks, including thirteen sub-tasks. Moreover, we have investigated the effect of different training strategies, including single-task learning, transfer learning, multi-task learning, and multi-task learning with fine-tuning. CodeTrans outperforms the state-of-the-art models on all the tasks. To expedite future works in the software engineering domain, we have published our pre-trained models of CodeTrans.</p>\n", "tags": ["Transformer"] },
 {"key": "feng2020codebert", "year": "2020", "title":"CodeBERT: A Pre-Trained Model for Programming and Natural Languages", "abstract": "<p>We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.</p>\n", "tags": ["pretraining"] },