09:309:45  Opening Remarks  
09:4511:00  Talk Session 1 (Chair: Brian Bullins)  
Omar Montasser  Adversarially Robust Learning with Unknown Perturbation Sets
Abstract. We study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set, relying instead on interaction with an adversarial attacker or access to attack oracles, examining different models for such interactions. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of successful attacks, in different interaction models, in terms of the VC and Littlestone dimensions of the hypothesis class of predictors, and without any assumptions on the perturbation set.


Kumar Kshitij Patel  A Stochastic Newton Method for Distributed Optimization
Abstract. We analyze a stochastic Newton algorithm for distributed convex optimization. At the heart of our approach is recent work showing that leastsquares objectives can be optimized to high accuracy using a parallel algorithm with only a single round of communication. Our algorithm expresses the Newton update as the solution to a leastsquares problem which we optimize using stochastic gradients and stochastic Hessianvector products for the objective, both of which can typically be computed efficiently. We analyze our method for quasiselfconcordant objectives (e.g. logistic regression), and demonstrate that it can in some instances achieve faster convergence rates than comparable firstorder methods while requiring less communication and a similar amount of computation.


11:0012:00  Keynote Talk: John Langford Speaker Bio 
Latent State Discovery
Abstract. If you add a sensor to a learning agent, does the problem that it solves become more difficult? Or does it become easier? Intuitions differ here. In one view, more information is better, so better decisions can be made yield better performance. In another, the learning theory of Markov Decision Processes suggests that it becomes more difficult because the 'state space' is larger. This conundrum can be resolved with the notion of a latent state  some underlying elements from which an observation space is created. But what is a good definition of a latent state? When and how can you learn a latent state?


12:0013:30  Poster Session 1 + Lunch Break  Held in the TTIC Gather.Town  
Sudarshan Babu  MetaLearning via Learning with Distributed Memory
Abstract. We demonstrate that metalearning can be achieved via endtoend training of deep neural networks with memory distributed across layers. Here, the persistent state of this memory assumes the entire burden of guiding task adaptation. Moreover, its distributed nature is instrumental in orchestrating adaptation. In ablation experiments, providing relevant feedback to only memory within deep layers significantly hinders performance. In contrast, allowing memory units to guide adaptation throughout the network is an effective strategy that simplifies metalearning  often cast as a bilevel optimization problem  to standard endtoend training. We show that using distributed memory as the mechanism for metalearning can outperform gradientbased, prototypebased, and other memorybased metalearning strategies on a variety of online tasks. Additionally, our adaptation strategy naturally handles online learning scenarios wherein there is a significant delay between observing a sample and its corresponding label  a setting in which other approaches struggle. Experiments demonstrate the efficacy of adaptation via distributed memory across a wide range of learning tasks, ranging from classification to online fewshot semantic segmentation.


Falcon Dai  Loop Estimator for Discounted Values in Markov Reward Processes
Abstract. At the working heart of policy iteration algorithms commonly used and studied in the discounted setting of reinforcement learning, the policy evaluation step estimates the value of states with samples from a Markov reward process induced by following a Markov policy in a Markov decision process. We propose a simple and efficient estimator called loop estimator that exploits the regenerative structure of Markov reward processes without explicitly estimating a full model. Our method enjoys a space complexity of $O(1)$ when estimating the value of a single positive recurrent state s unlike TD with $O(S)$ or modelbased methods with $O(S^2)$. Moreover, the regenerative structure enables us to show, without relying on the generative model approach, that the estimator has an instancedependent convergence rate of $\tilde{O}(\sqrt{\tau_s/T})$ over steps $T$ on a single sample path, where $\tau_s$ is the maximal expected hitting time to state s. In preliminary numerical experiments, the loop estimator outperforms modelfree methods, such as TD(k), and is competitive with the modelbased estimator.


Gene Li  Exponential Family ModelBased Reinforcement Learning via Score Matching
Abstract. We propose a optimistic modelbased algorithm, dubbed SMRL, for finitehorizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. SMRL achieves $\tilde O(d\sqrt{H^3T})$ regret, where $H$ is the length of each episode and $T$ is the total number of interactions.


Naren Manoj  Excess Capacity and Backdoor Poisoning
Abstract. A backdoor data poisoning attack is an adversarial attack in a supervised learning setting. Specifically, the attacker injects several watermarked, mislabeled training examples into a training set. The watermark does not impact the testtime performance of the classifier on typical data; however, the classifier reliably classifies watermarked examples as a label of the adversary's choice. Such attacks pose a threat to applications where the training data may not be fully trusted, such as in federated learning. To gain a better foundational understanding of backdoor data poisoning attacks, we present a formal theoretical framework within which one can discuss backdoor data poisoning attacks. We then use this to analyze important statistical and computational issues surrounding these attacks. On the statistical front, we identify a combinatorial parameter we call the memorization capacity that captures the intrinsic vulnerability of a learning problem to a backdoor attack. This allows us to argue about the robustness of several natural learning problems to backdoor attacks. Our results favoring the attacker involve presenting explicit constructions of backdoor attacks, and our robustness results show that some natural problem settings cannot yield successful backdoor attacks. From a computational standpoint, we show that under certain assumptions, adversarial training can detect the presence of backdoors in a training set. We then show that under similar assumptions, two closely related problems we call backdoor filtering and robust generalization are nearly equivalent. This implies that it is both necessary and sufficient to design algorithms that can identify watermarked examples in the training set in order to obtain a learning algorithm that both generalizes well to unseen data and is robust to backdoors. 

Ankita Pasad  Layerwise Analysis Of a Selfsupervised Speech Representation Model
Abstract. Recently proposed selfsupervised learning approaches have been successful for pretraining speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pretrained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. In this work, we begin to fill this gap by examining one recent and successful pretrained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools. We use the metrics of canonical correlation, mutual information, and performance on simple downstream tasks with nonparametric probes, in order to (i) query for acoustic and linguistic information content, (ii) characterize the evolution of information across model layers, and (iii) understand how finetuning the model for automatic speech recognition (ASR) affects these observations. Our findings motivate modifying the finetuning protocol for ASR, which produces improved word error rates in a lowresource setting.


Han Shao  Robust Learning Under Cleanlabel Attack
Abstract. We study the problem of robust learning under cleanlabel datapoisoning attacks, where the attacker injects (an arbitrary set of) correctlylabeled examples to the training set to fool the algorithm into making mistakes on specific test instances at test time. The learning goal is to minimize the attackable rate (the probability mass of attackable test instances), which is more difficult than optimal PAC learning. As we show, any robust algorithm with diminishing attackable rate can achieve the optimal dependence on $\epsilon$ in its PAC sample complexity, i.e., $O(1/\epsilon)$. On the other hand, the attackable rate might be large even for some optimal PAC learners, e.g., SVM for linear classifiers. Furthermore, we show that the class of linear hypotheses is not robustly learnable when the data distribution has zero margin and is robustly learnable in the case of positive margin but requires sample complexity exponential in the dimension. For a general hypothesis class with bounded VC dimension, if the attacker is limited to add at most $t>0$ poison examples, the optimal robust learning sample complexity grows almost linearly with $t$.


Bowen Shi  Fingerspelling Detection in American Sign Language
Abstract. Fingerspelling, in which words are signed letter by letter, is an important component of American Sign Language. Most previous work on automatic fingerspelling recognition has assumed that the boundaries of fingerspelling regions in signing videos are known beforehand. In this paper, we consider the task of fingerspelling detection in raw, untrimmed sign language videos. This is an important step towards building realworld fingerspelling recognition systems. We propose a benchmark and a suite of evaluation metrics, some of which reflect the effect of detection on the downstream fingerspelling recognition task. In addition, we propose a new model that learns to detect fingerspelling via multitask training, incorporating pose estimation and fingerspelling recognition (transcription) along with detection, and compare this model to several alternatives. The model outperforms all alternative approaches across all metrics, establishing a state of the art on the benchmark.


Kavya Ravichandran  Combinatorial Factor Analysis
Abstract. To understand a piece of information, we often seek to break it down into the components that comprise it to better understand the structure underlying its generation. In some cases (say, determining factors that influence performance in school), underlying attributes might reasonably be modeled as combining linearly to produce the observables, and we could solve a matrix factorization problem to derive some proposed attributes. In other cases, however, attributes might come together in a combinatorial or nonlinear way to produce the output. In this work, we formally define a model in which the underlying components are placed on a canvas, front to back, to generate the output. We consider the following process: consider a set of “objects” (bit strings); select a subset of them, locations on a canvas to place them, and an order in which to do so. When "rendering" the output "image", objects on top block from view objects further back. Now, given a large set of images, we wish to determine a small set of objects that generated these images. We formally define this setting and analyze algorithms for learning such a set of objects under different assumptions. In the future, we hope to use this understanding to analyze adversarial attacks on such a model, as well.


David Yunis  Toward Understanding SGD Trajectories on Loss Surfaces
Abstract. Given its highdimensional nature, it can be hard to understand the loss surface of neural networks used in practice. Curves of low loss between parameters from different training runs have been observed before [Draxler et al., Garipov et al.]. But for any two optima, linear low loss curves do not always exist [Keskar et al.]. Interestingly, when a single optimization run is split into two branches then after a small number of iterations, *linear mode connectivity* (LMC) occurs, where such a linear path of low loss exists between the final checkpoints of either branch [Frankle et al]. LMC is observed across domains and architectures. In our work we find that LMC is a special case of a general phenomenon: random convex combinations of endpoints of different optimization branches result in low loss models. Moreover all the functions in this subspace have similar errors, with the number of disagreements decreasing the further into training the branches are split. This suggests that mode connectivity is a proxy for identifiability of the functions that these models define. And that even though neural network optimization is nonconvex, after a small number of iterations it appears to share its behavior with convex optimization, at least in a special subspace.


13:3014:30  Talk Session 2 (Chair: Audrey Sedal)  
Shashank Srivastava  Nearlinear Time Decoding of TaShma's Codes via Splittable Regularity
Abstract. The GilbertVarshamov bound nonconstructively establishes the existence of codes of distance $1/2\epsilon/2$ and rate $\Omega(\epsilon^2)$. In a breakthrough result, TaShma gave an explicit code construction matching these parameters. While decoding algorithms based on the Sum of Squares hierarchy were known for these codes, we derive new algorithms that run in nearlinear time. Our algorithms follow from new structural and algorithmic results for collection of $k$tuples possessing a "structured expansion" property, which we call splittability. We obtain a new weak regularity decomposition for (possibly sparse) splittable constructions $W \subseteq [n]^k$, similar to the dense regularity decomposition for dense structures by Frieze and Kannan (FOCS 1996). These decompositions are computable in nearlinear time $\tilde{O}(W)$ and form a key component of our algorithmic results. 

Ziwei Xie  Deep Graph Learning of Interfacial Contacts from Sequence Embedding and 3D Structure
Abstract. Interprotein (interfacial) contact prediction is very useful for in silico structural characterization of proteinprotein interactions. Although interprotein contact prediction has been improved by deep learning, its accuracy is not as good as intrachain contact prediction. Here we propose a new deep learning method CaGCN for contact prediction of both homodimeric and heterodimeric protein complexes, by learning rotational invariant residuelevel representations from protein tertiary structures and integrating contextual embeddings of multiple sequence alignments (MSAs). Tested on the 13rd and 14th CASPCAPRI datasets, the average top L/10 precision achieved by CaGCN is 54.35% on the homodimeric targets and 51.56% on all the targets. In contrast, DeepHomo, the latest deep learning method for homodimeric contact prediction, has top L/10 precision only 30.43%. Another interprotein contact predictor, BIPSPI, has top L/10 precision only 14.69% on all targets. Further experiments show that the contacts predicted by CaGCN may improve the selection of docking decoys by 58% in terms of TMscore.


14:3015:30  Poster Session 2 (Same Posters)  Held in the TTIC Gather.Town  
15:3016:30  Panel Discussion  
Suriya Gunasekar, Nika Haghtalab, Eric Jonas, John Langford 
Please submit questions via sli.do  
16:3016:40  Awards 