ml4code
diff --git a/‎index.html‎
Lines changed: 1 addition & 0 deletions b/‎index.html‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -130,6 +130,7 @@ <h4 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h4>
 <tag><a href="/tags.html#code similarity">code similarity</a></tag>
 <tag><a href="/tags.html#compilation">compilation</a></tag>
 <tag><a href="/tags.html#completion">completion</a></tag>
+<tag><a href="/tags.html#cybersecurity">cybersecurity</a></tag>
 <tag><a href="/tags.html#dataset">dataset</a></tag>
 <tag><a href="/tags.html#decompilation">decompilation</a></tag>
 <tag><a href="/tags.html#defect">defect</a></tag>
 
@@ -72,6 +72,7 @@
 {"key": "cai2020tag", "year": "2020", "title":"TAG : Type Auxiliary Guiding for Code Comment Generation", "abstract": "<p>Existing leading code comment generation approaches with the structure-to-sequence framework ignores the type information of the interpretation of the code, e.g., operator, string, etc. However, introducing the type information into the existing framework is non-trivial due to the hierarchical dependence among the type information. In order to address the issues above, we propose a Type Auxiliary Guiding encoder-decoder framework for the code comment generation task which considers the source code as an N-ary tree with type information associated with each node. Specifically, our framework is featured with a Type-associated Encoder and a Type-restricted Decoder which enables adaptive summarization of the source code. We further propose a hierarchical reinforcement learning method to resolve the training difficulties of our proposed framework. Extensive evaluations demonstrate the state-of-the-art performance of our framework with both the auto-evaluated metrics and case studies.</p>\n", "tags": ["bimodal","documentation"] },
 {"key": "cambronero2019deep", "year": "2019", "title":"When Deep Learning Met Code Search", "abstract": "<p>There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural language queries, into real vectors and then using vector distance to approximate semantic correlation between code and the query. Multiple approaches exist for learning these embeddings, including unsupervised techniques, which rely only on a corpus of code examples, and supervised techniques, which use an aligned corpus of paired code and natural language descriptions. The goal of this supervision is to produce embeddings that are more similar for a query and the corresponding desired code snippet.</p>\n\n<p>Clearly, there are choices in whether to use supervised techniques at all, and if one does, what sort of network and training to use for supervision. This paper is the first to evaluate these choices systematically. To this end, we assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora. To explore the design space in network complexity, we also introduced a new design point that is a minimal supervision extension to an existing unsupervised technique.</p>\n\n<p>Our evaluation shows that: 1. adding supervision to an existing unsupervised technique can improve performance, though not necessarily by much; 2. simple networks for supervision can be more effective that more sophisticated sequence-based networks for code search; 3. while it is common to use docstrings to carry out supervision, there is a sizeable gap between the effectiveness of docstrings and a more query-appropriate supervision corpus.</p>\n", "tags": ["search"] },
 {"key": "campbell2014syntax", "year": "2014", "title":"Syntax Errors Just Aren’t Natural: Improving Error Reporting with Language Models", "abstract": "<p>A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in\nmany errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser.</p>\n", "tags": ["repair","language model"] },
+{"key": "casey2024survey", "year": "2024", "title":"A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks", "abstract": "<p>Machine learning techniques for cybersecurity-related software engineering tasks are becoming increasingly popular. The representation of source code is a key portion of the technique that can impact the way the model is able to learn the features of the source code. With an increasing number of these techniques being developed, it is valuable to see the current state of the field to better understand what exists and what’s not there yet. This paper presents a study of these existing ML-based approaches and demonstrates what type of representations were used for different cybersecurity tasks and programming languages. Additionally, we study what types of models are used with different representations. We have found that graph-based representations are the most popular category of representation, and Tokenizers and Abstract Syntax Trees (ASTs) are the two most popular representations overall. We also found that the most popular cybersecurity task is vulnerability detection, and the language that is covered by the most techniques is C. Finally, we found that sequence-based models are the most popular category of models, and Support Vector Machines (SVMs) are the most popular model overall.</p>\n", "tags": ["survey","cybersecurity","vulnerability"] },
 {"key": "cerulo2013hidden", "year": "2013", "title":"A Hidden Markov Model to Detect Coded Information Islands in Free Text", "abstract": "<p>Emails and issue reports capture useful knowledge about development practices, bug fixing, and change activities. Extracting such a content is challenging, due to the mix-up of\nsource code and natural language, unstructured text.</p>\n\n<p>In this paper we introduce an approach, based on Hidden Markov Models (HMMs), to extract coded information islands, such as source code, stack traces, and patches, from free text at a token level of granularity. We train a HMM for each category of information contained in the text, and adopt the Viterbi algorithm to recognize whether the sequence of tokens — e.g., words, language keywords, numbers, parentheses, punctuation marks, etc. — observed in a text switches among those HMMs. Although our implementation focuses on extracting source code from emails, the approach could be easily extended to include in principle any text-interleaved language.</p>\n\n<p>We evaluated our approach with respect to the state of art on a set of development emails and bug reports drawn from the software repositories of well known open source systems. Results indicate an accuracy between 82% and 99%, which is in line with existing approaches which, differently from ours, require the manual definition of regular expressions or parsers.</p>\n\n", "tags": ["information extraction"] },
 {"key": "cerulo2015irish", "year": "2015", "title":"Irish: A Hidden Markov Model to detect coded information islands in free text", "abstract": "<p>Developers’ communication, as contained in emails, issue trackers, and forums, is a precious source of information to support the development process. For example, it can\nbe used to capture knowledge about development practice or about a software project itself. Thus, extracting the content of developers’ communication can be useful to support\nseveral software engineering tasks, such as program comprehension, source code analysis, and software analytics. However, automating the extraction process is challenging, due to the unstructured nature of free text, which mixes different coding languages (e.g., source code, stack dumps, and log traces) with natural language parts.</p>\n\n<p>We conduct an extensive evaluation of Irish (InfoRmation ISlands Hmm), an approach we proposed to extract islands of coded information from free text at token granularity, with respect to the state of art approaches based on island parsing or island parsing combined with machine learners. The evaluation considers a wide set of natural language documents (e.g., textbooks, forum discussions, and development emails) taken from different contexts and encompassing different coding languages. Results indicate an F-measure of Irish between 74% and 99%; this is in line with existing approaches which, differently from Irish, require specific expertise for the definition of regular expressions or grammars.</p>\n\n", "tags": ["information extraction"] },
 {"key": "chae2016automatically", "year": "2016", "title":"Automatically generating features for learning program analysis heuristics", "abstract": "<p>We present a technique for automatically generating features for data-driven program analyses. Recently data-driven approaches for building a program analysis have been proposed, which mine existing codebases and automatically learn heuristics for finding a cost-effective abstraction for a given analysis task. Such approaches reduce the burden of the analysis designers, but they do not remove it completely; they still leave the highly nontrivial task of designing so called features to the hands of the designers. Our technique automates this feature design process. The idea is to use programs as features after reducing and abstracting them. Our technique goes through selected program-query pairs in codebases, and it reduces and abstracts the program in each pair to a few lines of code, while ensuring that the analysis behaves similarly for the original and the new programs with respect to the query. Each reduced program serves as a boolean feature for program-query pairs. This feature evaluates to true for a given program-query pair when (as a program) it is included in the program part of the pair. We have implemented our approach for three real-world program analyses. Our experimental evaluation shows that these analyses with automatically-generated features perform comparably to those with manually crafted features.</p>\n", "tags": ["representation"] },