ml4code
diff --git a/‎paper-abstracts.json‎
Lines changed: 2 additions & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 2 additions & 0 deletions
@@ -356,6 +356,8 @@
 {"key": "schuster2021you", "year": "2021", "title":"You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion", "abstract": "<p>Code autocompletion is an integral feature of modern code editors and IDEs. The latest generation of autocompleters uses neural language models, trained on public open-source code repositories, to suggest likely (not just statically feasible) completions given the current context.</p>\n\n<p>We demonstrate that neural code autocompleters are vulnerable to poisoning attacks. By adding a few specially-crafted files to the autocompleter’s training corpus (data poisoning), or else by directly fine-tuning the autocompleter on these files (model poisoning), the attacker can influence its suggestions for attacker-chosen contexts. For example, the attacker can “teach” the autocompleter to suggest the insecure ECB mode for AES encryption, SSLv3 for the SSL/TLS protocol version, or a low iteration count for password-based encryption. Moreover, we show that these attacks can be targeted: an autocompleter poisoned by a targeted attack is much more likely to suggest the insecure completion for files from a specific repo or specific developer.</p>\n\n<p>We quantify the efficacy of targeted and untargeted data- and model-poisoning attacks against state-of-the-art autocompleters based on Pythia and GPT-2. We then evaluate existing defenses against poisoning attacks and show that they are largely ineffective.</p>\n", "tags": ["autocomplete","adversarial"] },
 {"key": "sharma2015nirmal", "year": "2015", "title":"NIRMAL: Automatic Identification of Software Relevant Tweets Leveraging Language Model", "abstract": "<p>Twitter is one of the most widely used social media\nplatforms today. It enables users to share and view short 140-character messages called “tweets”. About 284 million active\nusers generate close to 500 million tweets per day. Such rapid\ngeneration of user generated content in large magnitudes results\nin the problem of information overload. Users who are interested\nin information related to a particular domain have limited means\nto filter out irrelevant tweets and tend to get lost in the huge\namount of data they encounter. A recent study by Singer et\nal. found that software developers use Twitter to stay aware of\nindustry trends, to learn from others, and to network with other\ndevelopers. However, Singer et al. also reported that developers\noften find Twitter streams to contain too much noise which is a\nbarrier to the adoption of Twitter. In this paper, to help developers\ncope with noise, we propose a novel approach named NIRMAL,\nwhich automatically identifies software relevant tweets from a\ncollection or stream of tweets. Our approach is based on language\nmodeling which learns a statistical model based on a training\ncorpus (i.e., set of documents). We make use of a subset of posts\nfrom StackOverflow, a programming question and answer site, as\na training corpus to learn a language model. A corpus of tweets\nwas then used to test the effectiveness of the trained language\nmodel. The tweets were sorted based on the rank the model\nassigned to each of the individual tweets. The top 200 tweets\nwere then manually analyzed to verify whether they are software\nrelated or not, and then an accuracy score was calculated. The\nresults show that decent accuracy scores can be achieved by\nvarious variants of NIRMAL, which indicates that NIRMAL can\neffectively identify software related tweets from a huge corpus of\ntweets.</p>\n", "tags": ["information extraction"] },
 {"key": "sharma2019feasibility", "year": "2019", "title":"On the Feasibility of Transfer-learning Code Smells using Deep Learning", "abstract": "<p><strong>Context</strong>: A substantial amount of work has been done to detect smells in source code using metrics-based and heuristics-based methods. Machine learning methods have been recently applied to detect source code smells; however, the current practices are considered far from mature.</p>\n\n<p><strong>Objective</strong>: First, explore the feasibility of applying deep learning models to detect smells without extensive feature engineering, just by feeding the source code in tokenized form. Second, investigate the possibility of applying transfer-learning in the context of deep learning models for smell detection.</p>\n\n<p><strong>Method</strong>: We use existing metric-based state-of-the-art methods for detecting three implementation smells and one design smell in C# code. Using these results as the annotated gold standard, we train smell detection models on three different deep learning architectures. These architectures use Convolution Neural Networks (CNNs) of one or two dimensions, or Recurrent Neural Networks (RNNs) as their principal hidden layers. For the first objective of our study, we perform training and evaluation on C# samples, whereas for the second objective, we train the models from C# code and evaluate the models over Java code samples. We perform the experiments with various combinations of hyper-parameters for each model.</p>\n\n<p><strong>Results</strong>: We find it feasible to detect smells using deep learning methods. Our comparative experiments find that there is no clearly superior method between CNN-1D and CNN-2D. We also observe that performance of the deep learning models is smell-specific. Our transfer-learning experiments show that transfer-learning is definitely feasible for implementation smells with performance comparable to that of direct-learning. This work opens up a new paradigm to detect code smells by transfer-learning especially for the programming languages where the comprehensive code smell detection tools are not available.</p>\n", "tags": ["representation","program analysis"] },
+{"key": "sharma2022exploratory", "year": "2022", "title":"An Exploratory Study on Code Attention in BERT", "abstract": "<p>Many recent models in software engineering introduced deep neural models based on the Transformer architecture or use transformer-based Pre-trained Language Models (PLM) trained on code. Although these models achieve the state of the arts results in many downstream tasks such as code summarization and bug detection, they are based on Transformer and PLM, which are mainly studied in the Natural Language Processing (NLP) field. The current studies rely on the reasoning and practices from NLP for these models in code, despite the differences between natural languages and programming languages. There is also limited literature on explaining how code is modeled. Here, we investigate the attention behavior of PLM on code and compare it with natural language. We pre-trained BERT, a Transformer based PLM, on code and explored what kind of information it learns, both semantic and syntactic. We run several experiments to analyze the attention values of code constructs on each other and what BERT learns in each layer. Our analyses show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token [CLS] in NLP. This observation motivated us to leverage identifiers to represent the code sequence instead of the [CLS] token when used for code clone detection. Our results show that employing embeddings from identifiers increases the performance of BERT by 605% and 4% F1-score in its lower layers and the upper layers, respectively. When identifiers’ embeddings are used in CodeBERT, a code-based PLM, the performance is improved by 21–24% in the F1-score of clone detection. The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP, and open new directions for developing smaller models with similar performance.</p>\n\n", "tags": ["Transformer","representation","language model","interpretability","pretraining","clone"] },
+{"key": "sharma2022lamner", "year": "2022", "title":"LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition", "abstract": "<p>Code comment generation is the task of generating a high-level natural language description for a given code method/function. Although researchers have been studying multiple ways to generate code comments automatically, previous work mainly considers representing a code token in its entirety semantics form only (e.g., a language model is used to learn the semantics of a code token), and additional code properties such as the tree structure of a code are included as an auxiliary input to the model. There are two limitations: 1) Learning the code token in its entirety form may not be able to capture information succinctly in source code, and 2)The code token does not contain additional syntactic information, inherently important in programming languages. In this paper, we present LAnguage Model and Named Entity Recognition (LAMNER), a code comment generator capable of encoding code constructs effectively and capturing the structural property of a code token. A character-level language model is used to learn the semantic representation to encode a code token. For the structural property of a token, a Named Entity Recognition model is trained to learn the different types of code tokens. These representations are then fed into an encoder-decoder architecture to generate code comments. We evaluate the generated comments from LAMNER and other baselines on a popular Java dataset with four commonly used metrics. Our results show that LAMNER is effective and improves over the best baseline model in BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, METEOR, and CIDEr by 14.34%, 18.98%, 21.55%, 23.00%, 10.52%, 1.44%, and 25.86%, respectively. Additionally, we fused LAMNER’s code representation with the baseline models, and the fused models consistently showed improvement over the nonfused models. The human evaluation further shows that LAMNER produces high-quality code comments.</p>\n\n", "tags": ["summarization","documentation","language model","types","representation"] },
 {"key": "she2019neuzz", "year": "2019", "title":"NEUZZ: Efficient Fuzzing with Neural Program Smoothing", "abstract": "<p>Fuzzing has become the de facto standard technique for finding software vulnerabilities. However, even state-of-the-art fuzzers are not very efficient at finding hard-to-trigger software bugs. Most popular fuzzers use evolutionary guidance to generate inputs that can trigger different bugs. Such evolutionary algorithms, while fast and simple to implement, often get stuck in fruitless sequences of random mutations. Gradient-guided optimization presents a promising alternative to evolutionary guidance. Gradient-guided techniques have been shown to significantly outperform evolutionary algorithms at solving high-dimensional structured optimization problems in domains like machine learning by efficiently utilizing gradients or higher-order derivatives of the underlying function. However, gradient-guided approaches are not directly applicable to fuzzing as real-world program behaviors contain many discontinuities, plateaus, and ridges where the gradient-based methods often get stuck. We observe that this problem can be addressed by creating a smooth surrogate function approximating the discrete branching behavior of target program. In this paper, we propose a novel program smoothing technique using surrogate neural network models that can incrementally learn smooth approximations of a complex, real-world program’s branching behaviors. We further demonstrate that such neural network models can be used together with gradient-guided input generation schemes to significantly improve the fuzzing efficiency. Our extensive evaluations demonstrate that NEUZZ significantly outperforms 10 state-of-the-art graybox fuzzers on 10 real-world programs both at finding new bugs and achieving higher edge coverage. NEUZZ found 31 unknown bugs that other fuzzers failed to find in 10 real world programs and achieved 3X more edge coverage than all of the tested graybox fuzzers for 24 hours running.</p>\n", "tags": ["fuzzing"] },
 {"key": "shi2019learning", "year": "2019", "title":"Learning Execution through Neural Code Fusion", "abstract": "<p>As the performance of computer systems stagnates due to the end of Moore’s Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of source code, these representations do not understand how code dynamically executes. In this work, we propose a new approach to use GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and complex data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26% and 45% respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related task (algorithm classification).</p>\n", "tags": ["representation"] },
 {"key": "shi2022cv4code", "year": "2022", "title":"CV4Code: Sourcecode Understanding via Visual Code Representations", "abstract": "<p>We present CV4Code, a compact and effective computer vision method for sourcecode understanding. Our method leverages the contextual and the structural information available from the code snippet by treating each snippet as a two-dimensional image, which naturally encodes the context and retains the underlying structural information through an explicit spatial representation. To codify snippets as images, we propose an ASCII codepoint-based image representation that facilitates fast generation of sourcecode images and eliminates redundancy in the encoding that would arise from an RGB pixel representation. Furthermore, as sourcecode is treated as images, neither lexical analysis (tokenisation) nor syntax tree parsing is required, which makes the proposed method agnostic to any particular programming language and lightweight from the application pipeline point of view. CV4Code can even featurise syntactically incorrect code which is not possible from methods that depend on the Abstract Syntax Tree (AST). We demonstrate the effectiveness of CV4Code by learning Convolutional and Transformer networks to predict the functional task, i.e. the problem it solves, of the source code directly from its two-dimensional representation, and using an embedding from its latent space to derive a similarity score of two code snippets in a retrieval setup. Experimental results show that our approach achieves state-of-the-art performance in comparison to other methods with the same task and data configurations. For the first time we show the benefits of treating sourcecode understanding as a form of image processing task.</p>\n", "tags": ["code similarity","Transformer"] },