ml4code
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -80,6 +80,7 @@
 {"key": "chibotaru2019scalable", "year": "2019", "title":"Scalable Taint Specification Inference with Big Code", "abstract": "<p>We present a new scalable, semi-supervised method for inferring\ntaint analysis specifications by learning from a large dataset of programs.\nTaint specifications capture the role of library APIs (source, sink, sanitizer)\nand are a critical ingredient of any taint analyzer that aims to detect\nsecurity violations based on information flow.</p>\n\n<p>The core idea of our method\nis to formulate the taint specification learning problem as a linear\noptimization task over a large set of information flow constraints.\nThe resulting constraint system can then be efficiently solved with\nstate-of-the-art solvers. Thanks to its scalability, our method can infer\nmany new and interesting taint specifications by simultaneously learning from\na large dataset of programs (e.g., as found on GitHub), while requiring \nfew manual annotations.</p>\n\n<p>We implemented our method in an end-to-end system,\ncalled Seldon, targeting Python, a language where static specification\ninference is particularly hard due to lack of typing information.\nWe show that Seldon is practically effective: it learned almost 7,000 API\nroles from over 210,000 candidate APIs with very little supervision\n(less than 300 annotations) and with high estimated precision (67%).\nFurther,using the learned specifications, our taint analyzer flagged more than\n20,000 violations in open source projects, 97% of which were\nundetectable without the inferred specifications.</p>\n", "tags": ["defect","program analysis"] },
 {"key": "chirkova2020empirical", "year": "2020", "title":"Empirical Study of Transformers for Source Code", "abstract": "<p>Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i. e. follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and all consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.</p>\n", "tags": ["Transformer"] },
 {"key": "chirkova2021embeddings", "year": "2021", "title":"On the Embeddings of Variables in Recurrent Neural Networks for Source Code", "abstract": "<p>Source code processing heavily relies on the methods widely used in natural language processing (NLP), but involves specifics that need to be taken into account to achieve higher quality. An example of this specificity is that the semantics of a variable is defined not only by its name but also by the contexts in which the variable occurs. In this work, we develop dynamic embeddings, a recurrent mechanism that adjusts the learned semantics of the variable when it obtains more information about the variable’s role in the program. We show that using the proposed dynamic embeddings significantly improves the performance of the recurrent neural network, in code completion and bug fixing tasks.</p>\n", "tags": ["autocomplete"] },
+{"key": "chow2023beware", "year": "2023", "title":"Beware of the Unexpected: Bimodal Taint Analysis", "abstract": "<p>Static analysis is a powerful tool for detecting security vulnerabilities and other programming problems. Global taint tracking, in particular, can spot vulnerabilities arising from complicated data flow across multiple functions. However, precisely identifying which flows are problematic is challenging, and sometimes depends on factors beyond the reach of pure program analysis, such as conventions and informal knowledge. For example, learning that a parameter <code class=\"language-plaintext highlighter-rouge\">name</code> of an API function <code class=\"language-plaintext highlighter-rouge\">locale</code> ends up in a file path is surprising and potentially problematic. In contrast, it would be completely unsurprising to find that a parameter <code class=\"language-plaintext highlighter-rouge\">command</code> passed to an API function <code class=\"language-plaintext highlighter-rouge\">execaCommand</code> is eventually interpreted as part of an operating-system command. This paper presents Fluffy, a bimodal taint analysis that combines static analysis, which reasons about data flow, with machine learning, which probabilistically determines which flows are potentially problematic. The key idea is to let machine learning models predict from natural language information involved in a taint flow, such as API names, whether the flow is expected or unexpected, and to inform developers only about the latter. We present a general framework and instantiate it with four learned models, which offer different trade-offs between the need to annotate training data and the accuracy of predictions. We implement Fluffy on top of the CodeQL analysis framework and apply it to 250K JavaScript projects. Evaluating on five common vulnerability types, we find that Fluffy achieves an F1 score of 0.85 or more on four of them across a variety of datasets.</p>\n", "tags": ["static analysis"] },
 {"key": "ciurumelea2020suggesting", "year": "2020", "title":"Suggesting Comment Completions for Python using Neural Language Models", "abstract": "<p>Source-code comments are an important communication medium between developers to better understand and maintain software. Current research focuses on auto-generating comments by summarizing the code. However, good comments contain additional details, like important design decisions or required trade-offs, and only developers can decide on the proper comment content. Automated summarization techniques cannot include information that does not exist in the code, therefore fully-automated approaches while helpful, will be of limited use. In our work, we propose to empower developers through a semi-automated system instead. We investigate the feasibility of using neural language models trained on a large corpus of Python documentation strings to generate completion suggestions and obtain promising results. By focusing on confident predictions, we can obtain a top-3 accuracy of over 70%, although this comes at the cost of lower suggestion frequency. Our models can be improved by leveraging context information like the signature and the full body of the method. Additionally, we are able to return good accuracy completions even for new projects, suggesting the generalizability of our approach.</p>\n", "tags": ["bimodal","autocomplete","documentation"] },
 {"key": "clement2020pymt5", "year": "2020", "title":"PyMT5: multi-mode translation of natural language and Python code with transformers", "abstract": "<p>Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation.</p>\n", "tags": ["bimodal","code generation","summarization","documentation","language model","pretraining"] },
 {"key": "clement2021distilling", "year": "2021", "title":"Distilling Transformers for Neural Cross-Domain Search", "abstract": "<p>Pre-trained transformers have recently clinched top spots in the gamut of natural language tasks and pioneered solutions to software engineering tasks. Even information retrieval has not been immune to the charm of the transformer, though their large size and cost is generally a barrier to deployment. While there has been much work in streamlining, caching, and modifying transformer architectures for production, here we explore a new direction: distilling a large pre-trained translation model into a lightweight bi-encoder which can be efficiently cached and queried. We argue from a probabilistic perspective that sequence-to-sequence models are a conceptually ideal—albeit highly impractical—retriever. We derive a new distillation objective, implementing it as a data augmentation scheme. Using natural language source code search as a case study for cross-domain search, we demonstrate the validity of this idea by significantly improving upon the current leader of the CodeSearchNet challenge, a recent natural language code search benchmark.</p>\n", "tags": ["search","Transformer"] },