ml4code
diff --git a/‎index.html‎
Lines changed: 1 addition & 1 deletion b/‎index.html‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions b/‎paper-abstracts.json‎
Lines changed: 1 addition & 0 deletions
@@ -223,7 +223,7 @@ <h4 id="contributors-to-the-website">Contributors to the website</h4>
   <li><a href="http://www.cs.technion.ac.il/~urialon/">Uri Alon</a> Technion, Israel</li>
   <li><a href="https://shakedbr.cswp.cs.technion.ac.il/">Shaked Brody</a> Technion, Israel</li>
   <li><a href="https://bdqnghi.github.io/">Nghi D. Q. Bui</a> Singapore Management University, Singapore</li>
-  <li><a href="https://rajaswa.github.io/">Rajaswa Patil</a> TCS Research, India</li>
+  <li><a href="https://rajaswa.github.io/">Rajaswa Patil</a> Microsoft PROSE</li>
 </ul>
 
     </div>
 
@@ -271,6 +271,7 @@
 {"key": "murali2017bayesian", "year": "2018", "title":"Bayesian Sketch Learning for Program Synthesis", "abstract": "<p>We present a Bayesian statistical approach to the problem of automatic program synthesis. Our synthesizer starts\nby learning, offline and from an existing corpus, a probabilistic model of real-world programs. During synthesis,\nit is provided some ambiguous and incomplete evidence about the nature of the programming task that the user\nwants automated, for example sets of API calls or data types that are relevant for the task. Given this input, the\nsynthesizer infers a posterior distribution over type-safe programs that assigns higher likelihood to programs\nthat, according to the learned model, are more likely to match the evidence.</p>\n\n<p>We realize this approach using two key ideas. First, our learning techniques operate not over code but\nsyntactic abstractions, or sketches, of programs. During synthesis, we infer a posterior distribution over sketches,\nthen concretize samples from this distribution into type-safe programs using combinatorial techniques. Second,\nour statistical model explicitly models the full intent behind a synthesis task as a latent variable. To infer\nsketches, we first estimate a posterior distribution on the intent, then use samples from this posterior to generate\na distribution over possible sketches. We show that our model can be implemented effectively using the new\nneural architecture of Bayesian encoder-decoders, which can be trained with stochastic gradient descent and\nyields a simple inference procedure.</p>\n\n<p>We implement our ideas in a system, called BAYOU , for the synthesis of API-heavy Java methods. We train\nBAYOU on a large corpus of Android apps, and find that the trained system can often synthesize complex\nmethods given just a few API method names or data types as evidence. The experiments also justify the design\nchoice of using a latent intent variable and the levels of abstraction at which sketches and evidence are defined.</p>\n", "tags": ["code generation","API"] },
 {"key": "murali2017finding", "year": "2017", "title":"Finding Likely Errors with Bayesian Specifications", "abstract": "<p>We present a Bayesian framework for learning probabilistic specifications from large, unstructured code corpora, and\na method to use this framework to statically detect anomalous, hence likely buggy, program behavior. The distinctive\ninsight here is to build a statistical model that correlates all\nspecifications hidden inside a corpus with the syntax and\nobserved behavior of programs that implement these specifications. During the analysis of a particular program, this\nmodel is conditioned into a posterior distribution that prioritizes specifications that are relevant to this program. This\nallows accurate program analysis even if the corpus is highly\nheterogeneous. The problem of finding anomalies is now\nframed quantitatively, as a problem of computing a distance\nbetween a “reference distribution” over program behaviors\nthat our model expects from the program, and the distribution over behaviors that the program actually produces.</p>\n\n<p>We present a concrete embodiment of our framework that\ncombines a topic model and a neural network model to learn\nspecifications, and queries the learned models to compute\nanomaly scores. We evaluate this implementation on the\ntask of detecting anomalous usage of Android APIs. Our\nencouraging experimental results show that the method can\nautomatically discover subtle errors in Android applications\nin the wild, and has high precision and recall compared to\ncompeting probabilistic approaches.</p>\n", "tags": ["program analysis","API"] },
 {"key": "nadeem2022codedsi", "year": "2022", "title":"CodeDSI: Differentiable Code Search", "abstract": "<p>Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation – neural code search – is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, we present CodeDSI, an end-to-end unified approach to code search. CodeDSI is trained to directly map natural language queries to their respective code samples, which can be retrieved later. In an effort to improve the performance of code search, we have investigated docid representation strategies, impact of tokenization on docid structure, and dataset sizes on overall code search performance. Our results demonstrate CodeDSI strong performance, exceeding conventional robust baselines by 2-6% across varying dataset sizes.</p>\n", "tags": ["search"] },
+{"key": "naik2022probing", "year": "2022", "title":"Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis", "abstract": "<p>Representational Similarity Analysis is a method from cognitive neuroscience, which helps in comparing representations from two different sources of data. In this paper, we propose using Representational Similarity Analysis to probe the semantic grounding in language models of code. We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset. Through our experiments, we show that current pre-training methods do not induce semantic grounding in language models of code, and instead focus on optimizing form-based patterns. We also show that even a little amount of fine-tuning on semantically relevant tasks increases the semantic grounding in CodeBERT significantly. Our ablations with the input modality to the CodeBERT model show that using bimodal inputs (code and natural language) over unimodal inputs (only code) gives better semantic grounding and sample efficiency during semantic fine-tuning. Finally, our experiments with semantic perturbations in code reveal that CodeBERT is able to robustly distinguish between semantically correct and incorrect code.</p>\n", "tags": ["interpretability","language model","evaluation","Transformer"] },
 {"key": "nair2020funcgnn", "year": "2020", "title":"funcGNN: A Graph Neural Network Approach to Program Similarity", "abstract": "<p>Program similarity is a fundamental concept, central to the solution of software engineering tasks such as software plagiarism, clone identification, code refactoring and code search. Accurate similarity estimation between programs requires an in-depth understanding of their structure, semantics and flow. A control flow graph (CFG), is a graphical representation of a program which captures its logical control flow and hence its semantics. A common approach is to estimate program similarity by analysing CFGs using graph similarity measures, e.g. graph edit distance (GED). However, graph edit distance is an NP-hard problem and computationally expensive, making the application of graph similarity techniques to complex software programs impractical. This study intends to examine the effectiveness of graph neural networks to estimate program similarity, by analysing the associated control flow graphs. We introduce funcGNN, which is a graph neural network trained on labeled CFG pairs to predict the GED between unseen program pairs by utilizing an effective embedding vector. To our knowledge, this is the first time graph neural networks have been applied on labeled CFGs for estimating the similarity between high-level language programs. Results: We demonstrate the effectiveness of funcGNN to estimate the GED between programs and our experimental analysis demonstrates how it achieves a lower error rate (0.00194), with faster (23 times faster than the quickest traditional GED approximation method) and better scalability compared with the state of the art methods. funcGNN posses the inductive learning ability to infer program structure and generalise to unseen programs. The graph embedding of a program proposed by our methodology could be applied to several related software engineering problems (such as code plagiarism and clone identification) thus opening multiple research directions.</p>\n", "tags": ["GNN","clone"] },
 {"key": "nguyen2013lexical", "year": "2013", "title":"Lexical Statistical Machine Translation for Language Migration", "abstract": "<p>Prior research has shown that source code also exhibits naturalness, i.e. it is written by humans and is likely to be\nrepetitive. The researchers also showed that the n-gram language model is useful in predicting the next token in a source\nfile given a large corpus of existing source code. In this paper, we investigate how well statistical machine translation\n(SMT) models for natural languages could help in migrating source code from one programming language to another.\nWe treat source code as a sequence of lexical tokens and\napply a phrase-based SMT model on the lexemes of those\ntokens. Our empirical evaluation on migrating two Java\nprojects into C# showed that lexical, phrase-based SMT\ncould achieve high lexical translation accuracy ( BLEU from\n81.3-82.6%). Users would have to manually edit only 11.9-15.8% of the total number of tokens in the resulting code to\ncorrect it. However, a high percentage of total translation\nmethods (49.5-58.6%) is syntactically incorrect. Therefore,\nour result calls for a more program-oriented SMT model that\nis capable of better integrating the syntactic and semantic\ninformation of a program to support language migration.</p>\n", "tags": ["migration","API"] },
 {"key": "nguyen2013statistical", "year": "2013", "title":"A Statistical Semantic Language Model for Source Code", "abstract": "<p>Recent research has successfully applied the statistical n-gram language model to show that source code exhibits a\ngood level of repetition. The n-gram model is shown to have\ngood predictability in supporting code suggestion and completion. However, the state-of-the-art n-gram approach to\ncapture source code regularities/patterns is based only on\nthe lexical information in a local context of the code units.\nTo improve predictability, we introduce SLAMC, a novel statistical semantic language model for source code. It incorporates semantic information into code tokens and models the\nregularities/patterns of such semantic annotations, called sememes, rather than their lexemes. It combines the local context in semantic n-grams with the global technical concerns/functionality into an n-gram topic model, together with pairwise associations of program elements. Based on SLAMC,\nwe developed a new code suggestion method, which is empirically evaluated on several projects to have relatively 18–68%\nhigher accuracy than the state-of-the-art approach.</p>\n\n", "tags": ["language model"] },