|
203 | 203 | {"key": "lee2020montage", "year": "2020", "title":"Montage: A Neural Network Language Model-Guided JavaScript Engine Fuzzer", "abstract": "<p>JavaScript (JS) engine vulnerabilities pose significant security threats affecting billions of web browsers. While fuzzing is a prevalent technique for finding such vulnerabilities, there have been few studies that leverage the recent advances in neural network language models (NNLMs). In this paper, we present Montage, the first NNLM-guided fuzzer for finding JS engine vulnerabilities. The key aspect of our technique is to transform a JS abstract syntax tree (AST) into a sequence of AST subtrees that can directly train prevailing NNLMs. We demonstrate that Montage is capable of generating valid JS tests, and show that it outperforms previous studies in terms of finding vulnerabilities. Montage found 37 real-world bugs, including three CVEs, in the latest JS engines, demonstrating its efficacy in finding JS engine bugs.</p>\n", "tags": ["fuzzing","language model"] }, |
204 | 204 | {"key": "lee2021cotraining", "year": "2021", "title":"Co-Training for Commit Classification", "abstract": "<p>Commits in version control systems (e.g. Git) track changes in a software project. Commits comprise noisy user-generated natural language and code patches. Automatic commit classification (CC) has been used to determine the type of code maintenance activities performed, as well as to detect bug fixes in code repositories. Much prior work occurs in the fully-supervised setting – a setting that can be a stretch in resource-scarce situations presenting difficulties in labeling commits. In this paper, we apply co-training, a semi-supervised learning method, to take advantage of the two views available – the commit message (natural language) and the code changes (programming language) – to improve commit classification.</p>\n", "tags": ["Transformer","bimodal","defect"] }, |
205 | 205 | {"key": "levy2017learning", "year": "2017", "title":"Learning to Align the Source Code to the Compiled Object Code", "abstract": "<p>We propose a new neural network architecture\nand use it for the task of statement-by-statement\nalignment of source code and its compiled object code. Our architecture learns the alignment\nbetween the two sequences – one being the translation of the other – by mapping each statement\nto a context-dependent representation vector and\naligning such vectors using a grid of the two sequence domains. Our experiments include short\nC functions, both artificial and human-written,\nand show that our neural network architecture\nis able to predict the alignment with high accuracy, outperforming known baselines. We also\ndemonstrate that our model is general and can\nlearn to solve graph problems such as the Traveling Salesman Problem.</p>\n", "tags": ["decompilation"] }, |
| 206 | +{"key": "lherondelle2022topical", "year": "2022", "title":"Topical: Learning Repository Embeddings from Source Code using Attention", "abstract": "<p>Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode\naugments the software developer’s capabilities with code autogeneration, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level\nrepresentation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example,\nauto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language\ndocumentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a\ndeep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the\nscript level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that\nwere crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines\nthat naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging. Furthermore, we show that Topical’s attention mechanism outperforms naive aggregation methods when computing repositorylevel representations from script-level representation generated\nby existing methods. Topical is a lightweight framework for computing repository-level representation of code repositories that scales efficiently with the number of topics and dataset size.</p>\n", "tags": ["representation","topic modelling"] }, |
206 | 207 | {"key": "li2016gated", "year": "2016", "title":"Gated Graph Sequence Neural Networks", "abstract": "<p>Graph-structured data appears frequently in domains including chemistry, natural\nlanguage semantics, social networks, and knowledge bases. In this work, we study\nfeature learning techniques for graph-structured inputs. Our starting point is previous work on Graph Neural Networks (Scarselli et al., 2009), which we modify\nto use gated recurrent units and modern optimization techniques and then extend\nto output sequences. The result is a flexible and broadly useful class of neural network models that has favorable inductive biases relative to purely sequence-based\nmodels (e.g., LSTMs) when the problem is graph-structured. We demonstrate the\ncapabilities on some simple AI (bAbI) and graph algorithm learning tasks. We\nthen show it achieves state-of-the-art performance on a problem from program\nverification, in which subgraphs need to be described as abstract data structures.</p>\n\n", "tags": ["GNN","program analysis"] }, |
207 | 208 | {"key": "li2017code", "year": "2017", "title":"Code Completion with Neural Attention and Pointer Networks", "abstract": "<p>Intelligent code completion has become an essential tool to accelerate modern software development. To facilitate effective code completion for dynamically-typed programming languages, we apply neural language models by learning from large codebases, and investigate the effectiveness of attention mechanism on the code completion task. However, standard neural language models even with attention mechanism cannot correctly predict out-of-vocabulary (OoV) words thus restrict the code completion performance. In this paper, inspired by the prevalence of locally repeated terms in program source code, and the recently proposed pointer networks which can reproduce words from local context, we propose a pointer mixture network for better predicting OoV words in code completion. Based on the context, the pointer mixture network learns to either generate a within-vocabulary word through an RNN component, or copy an OoV word from local context through a pointer component. Experiments on two benchmarked datasets demonstrate the effectiveness of our attention mechanism and pointer mixture network on the code completion task.</p>\n\n", "tags": ["language model","autocomplete"] }, |
208 | 209 | {"key": "li2017software", "year": "2017", "title":"Software Defect Prediction via Convolutional Neural Network", "abstract": "<p>To improve software reliability, software defect prediction is utilized to assist developers in finding potential bugs\nand allocating their testing efforts. Traditional defect prediction\nstudies mainly focus on designing hand-crafted features, which\nare input into machine learning classifiers to identify defective\ncode. However, these hand-crafted features often fail to capture\nthe semantic and structural information of programs. Such\ninformation is important in modeling program functionality and\ncan lead to more accurate defect prediction.\nIn this paper, we propose a framework called Defect Prediction\nvia Convolutional Neural Network (DP-CNN), which leverages\ndeep learning for effective feature generation. Specifically, based\non the programs’ Abstract Syntax Trees (ASTs), we first extract\ntoken vectors, which are then encoded as numerical vectors\nvia mapping and word embedding. We feed the numerical\nvectors into Convolutional Neural Network to automatically\nlearn semantic and structural features of programs. After that,\nwe combine the learned features with traditional hand-crafted\nfeatures, for accurate software defect prediction. We evaluate our\nmethod on seven open source projects in terms of F-measure in\ndefect prediction. The experimental results show that in average,\nDP-CNN improves the state-of-the-art method by 12%.</p>\n\n", "tags": ["defect"] }, |
|
0 commit comments