#### Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls

We propose a rank-$k$ variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball. Our algorithm replaces the top singular-vector computation ($1$-SVD) in Frank-Wolfe with a top-$k$ singular-vector computation ($k$-SVD), which can be done by repeatedly applyin…

#### Learning Differentially Private Language Models Without Losing Accuracy

We demonstrate that it is possible to train large recurrent language models with user-level differential privacy guarantees without sacrificing predictive accuracy. Our work builds on recent advances in the training of deep networks on user-partitioned data and privacy accounting for stochastic gra…

#### Asynchronous Decentralized Parallel Stochastic Gradient Descent

Recent work shows that decentralized parallel stochastic gradient decent (D-PSGD) can outperform its centralized counterpart both theoretically and practically. While asynchronous parallelism is a powerful technology to improve the efficiency of parallelism in distributed machine learning platforms…

#### Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

The past decade has witnessed a successful application of deep learning to solving many challenging problems in machine learning and artificial intelligence. However, the loss functions of deep neural networks (especially nonlinear networks) are still far from being well understood from a theoretic…

#### SQG-Differential Evolution for difficult optimization problems under a tight function evaluation budget

In the context of industrial engineering it is important to integrate efficient computational optimization methods in the product development process. Some of the most challenging simulation based engineering design optimization problems are characterized by: a large number of design variables, the…

#### Stochastic Weighted Function Norm Regularization

Deep neural networks (DNNs) have become increasingly important due to their excellent empirical performance on a wide range of problems. However, regularization is generally achieved by indirect means, largely due to the complex set of functions defined by a network and the difficulty in measuring …

#### Calibrated Boosting-Forest

Excellent ranking power along with well calibrated probability estimates are needed in many classification tasks. In this paper, we introduce a technique, Calibrated Boosting-Forest that captures both. This novel technique is an ensemble of gradient boosting machines that can support both continuou…

#### Understanding Generalization and Stochastic Gradient Descent

This paper tackles two related questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work is inspired by Zhang et al. (2017), who showed deep networks can easily mem…

#### Convergence diagnostics for stochastic gradient descent with constant step size

Iterative procedures in stochastic optimization are typically comprised of a transient phase and a stationary phase. During the transient phase the procedure converges towards a region of interest, and during the stationary phase the procedure oscillates in a convergence region, commonly around a s…

#### Accelerated Block Coordinate Proximal Gradients with Applications in High Dimensional Statistics

Nonconvex optimization problems arise in different research fields and arouse lots of attention in signal processing, statistics and machine learning. In this work, we explore the accelerated proximal gradient method and some of its variants which have been shown to converge under nonconvex context…

#### On the (Statistical) Detection of Adversarial Examples

Machine Learning (ML) models are applied in a variety of tasks such as network intrusion detection or Malware classification. Yet, these models are vulnerable to a class of malicious inputs known as adversarial examples. These are slightly perturbed inputs that are classified incorrectly by the ML …

#### Spontaneous Symmetry Breaking in Neural Networks

We propose a framework to understand the unprecedented performance and robustness of deep neural networks using field theory. Correlations between the weights within the same layer can be described by symmetries in that layer, and networks generalize better if such symmetries are broken to reduce t…

#### Flow: Architecture and Benchmarking for Reinforcement Learning in Traffic Control

car gradient Reinforcement Learning

Flow is a new computational framework, built to support a key need triggered by the rapid growth of autonomy in ground traffic: controllers for autonomous vehicles in the presence of complex nonlinear dynamics in traffic. Leveraging recent advances in deep Reinforcement Learning (RL), Flow enables …

#### Manifold Regularization for Kernelized LSTD

gradient Reinforcement Learning

Policy evaluation or value function or Q-function approximation is a key procedure in reinforcement learning (RL). It is a necessary component of policy iteration and can be used for variance reduction in policy gradient methods. Therefore its quality has a significant impact on most RL algorithms.…

#### SuperSpike: Supervised learning in multi-layer spiking neural networks

A vast majority of computation in the brain is performed by spiking neural networks. Despite the ubiquity of such spiking, we currently lack an understanding of how biological spiking neural circuits learn and compute in-vivo, as well as how we can instantiate such capabilities in artificial spikin…

#### On orthogonality and learning recurrent networks with long term dependencies

It is well known that it is challenging to train deep neural networks and recurrent neural networks for tasks that exhibit long term dependencies. The vanishing or exploding gradient problem is a well known issue associated with these challenges. One approach to addressing vanishing and exploding g…

#### High-dimensional dynamics of generalization error in neural networks

We perform an average case analysis of the generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant “high-dimensional” regime where the number of free parameters in the network is on the order of or even larger than the number of…

#### Fast and Strong Convergence of Online Learning Algorithms

In this paper, we study the online learning algorithm without explicit regularization terms. This algorithm is essentially a stochastic gradient descent scheme in a reproducing kernel Hilbert space (RKHS). The polynomially decaying step size in each iteration can play a role of regularization to en…

#### On- and Off-Policy Monotonic Policy Improvement

gradient Reinforcement Learning

Monotonic policy improvement and off-policy learning are two main desirable properties for reinforcement learning algorithms. In this study, we show that the monotonic policy improvement is guaranteed from on- and off-policy mixture data. Based on the theoretical result, we provide an algorithm whi…

#### Function space analysis of deep learning representation layers

In this paper we propose a function space approach to Representation Learning and the analysis of the representation layers in deep learning architectures. We show how to compute a weak-type Besov smoothness index that quantifies the geometry of the clustering in the feature space. This approach wa…

#### Unifying Local and Global Change Detection in Dynamic Networks

Many real-world networks are complex dynamical systems, where both local (e.g., changing node attributes) and global (e.g., changing network topology) processes unfold over time. Local dynamics may provoke global changes in the network, and the ability to detect such effects could have profound imp…

#### SGD for robot motion? The effectiveness of stochastic optimization on a new benchmark for biped locomotion tasks

Trajectory optimization and posture generation are hard problems in robot locomotion, which can be non-convex and have multiple local optima. Progress on these problems is further hindered by a lack of open benchmarks, since comparisons of different solutions are difficult to make. In this paper we…

#### On denoising autoencoders trained to minimise binary cross-entropy

Denoising autoencoders (DAEs) are powerful deep learning models used for feature extraction, data generation and network pre-training. DAEs consist of an encoder and decoder which may be trained simultaneously to minimise a loss (function) between an input and the reconstruction of a corrupted vers…

#### Recurrent Network-based Deterministic Policy Gradient for Solving Bipedal Walking Challenge on Rugged Terrains

gradient LSTM Reinforcement Learning RNN

This paper presents the learning algorithm based on the Recurrent Network-based Deterministic Policy Gradient. The Long-Short Term Memory is utilized to enable the Partially Observed Markov Decision Process framework. The novelty are improvements of LSTM networks: update of multi-step temporal diff…

#### Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo

A key task in Bayesian statistics is sampling from distributions that are only specified up to a partition function (i.e., constant of proportionality). However, without any assumptions, sampling (even approximately) can be #P-hard, and few works have provided “beyond worst-case” guaran…