Available Publications


Miscellaneous

Backpropagation

Reinforcement Learning

Unsupervised Learning

Genetic Algorithms


Online Local Gain Adaptation for Multi-Layer Perceptrons

by N. N. Schraudolph
Technical Report IDSIA-09-98
IDSIA, Lugano 1998

We introduce a new method for adapting the step size of each individual weight in a multi-layer perceptron trained by stochastic gradient descent. Our technique derives from the K1 algorithm for linear systems (Sutton, 1992), which in turn is based on a diagonalized Kalman Filter. We expand upon Sutton's work in two regards: K1 is a) extended to nonlinear systems, and b) made more efficient by linearizing an exponentiation operation. The resulting ELK1 (extended, linearized K1) algorithm is computationally little more expensive than alternative proposals (Zimmermann, 1994; Almeida et al., 1997, 1998), and does not require an arbitrary smoothing parameter. On a first benchmark problem ELK1 clearly outperforms these alternatives, as well as stochastic gradient descent with momentum, even when the number of floating-point operations required per weight update is taken into account. Unlike the method of Almeida et al., ELK1 does not require statistical independence between successive training patterns.


A Fast, Compact Approximation of the Exponential Function

by N. N. Schraudolph
Technical Report IDSIA-07-98
IDSIA, Lugano 1998

Neural network simulations often spend a large proportion of their time computing exponential functions. Since the exponentiation routines of typical math libraries are rather slow, their replacement with a fast approximation can greatly reduce the overall computation time. This note describes how exponentiation can be approximated by manipulating the components of a standard (IEEE-754) floating-point representation. This models the exponential function as well as a lookup table with linear interpolation, but is significantly faster and more compact.


Processing Images by Semi-Linear Predictability Minimization

by M. Eldracher, N. N. Schraudolph, and J. Schmidhuber
Technical Report IDSIA-77-97
IDSIA, Lugano 1997

In the predictability minimization approach (Schmidhuber, 1992), input patterns are fed into a system consisting of adaptive, initially unstructured feature detectors. There are also adaptive predictors constantly trying to predict current feature detector outputs from other feature detector outputs. Simultaneously, however, the feature detectors try to become as unpredictable as possible, resulting in a co-evolution of predictors and feature detectors.

This report describes the implementation of a visual processing system trained by semi-linear predictability minimization, and presents many experiments that examine its response to artificial and real-world images. In particular, we observe that under a wide variety of conditions, predictability minimization results in the development of well-known visual feature detectors.


On Centering Neural Network Weight Updates

by N.N. Schraudolph
Technical Report IDSIA-19-97
IDSIA, Lugano 1997

To appear in: Müller, Orr, and Caruana (eds.),
Tricks of the Trade: How to Make Neural Networks Really Work (working title)
Lecture Notes in Computer Science, Springer Verlag, Berlin 1998

It has long been known that neural networks can learn faster when their input and hidden unit activity is centered about zero; recently we have extended this approach to also encompass the centering of error signals (Schraudolph & Sejnowski, 1996). Here we generalize this notion to all factors involved in the weight update, leading us to propose centering the slope of hidden unit activation functions as well. Slope centering removes the linear component of backpropagated error; this improves credit assignment in networks with shortcut connections. Benchmark results show that this can speed up learning significantly without adversely affecting the trained network's generalization ability.


Reinforcement Learning with Self-Modifying Policies

by J. Schmidhuber, J. Zhao, and N.N. Schraudolph
In: Thrun and Pratt (eds.), Learning to Learn
Kluwer Academic Publishers, Norwell 1998

A learner's modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP). SMPs can modify the way they modify themselves etc. They are of interest in situations where the initial learning algorithm itself can be improved by experience - this is what we call ``learning to learn''. How can we force some (stochastic) SMP to trigger better and better self-modifications? The success-story algorithm (SSA) addresses this question in a lifelong reinforcement learning context. During the learner's life-time, SSA is occasionally called at times computed according to SMP itself. SSA uses backtracking to undo those SMP-generated SMP-modifications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call - this evaluates the long-term effects of SMP-modifications setting the stage for later SMP-modifications). SMP-modifications that survive SSA represent a lifelong success history. Until the next SSA call, they build the basis for additional SMP-modifications. Solely by self-modifications our SMP/SSA-based learners solve a complex task in a partially observable environment (POE) whose state space is far bigger than most reported in the POE literature.


Tempering Backpropagation Networks: Not All Weights are Created Equal

by N.N. Schraudolph and T.J. Sejnowski
Advances in Neural Information Processing Systems 8:563-569
MIT Press, Cambridge 1996

Backpropagation learning algorithms typically collapse the network's structure into a single vector of weight parameters to be optimized. We suggest that their performance may be improved by utilizing the structural information instead of discarding it, and introduce a framework for ``tempering'' each weight accordingly.

In the tempering model, activation and error signals are treated as approximately independent random variables. The characteristic scale of weight changes is then matched to that of the residuals, allowing structural properties such as a node's fan-in and fan-out to affect the local learning rate and backpropagated error. The model also permits calculation of an upper bound on the global learning rate for batch updates, which in turn leads to different update rules for bias vs. non-bias weights.

This approach yields hitherto unparalleled performance on the family relations benchmark, a deep multi-layer network: for both batch learning with momentum and the delta-bar-delta algorithm, convergence at the optimal learning rate is sped up by more than an order of magnitude.


Empirical Entropy Manipulation for Real-World Problems

by P. Viola, N. N. Schraudolph, and T. J. Sejnowski
Advances in Neural Information Processing Systems 8:851-857
MIT Press, Cambridge 1996

No finite sample is sufficient to determine the density, and therefore the entropy, of a signal directly. Some assumption about either the functional form of the density or about its smoothness is necessary. Both amount to a prior over the space of possible density functions. By far the most common approach is to assume that the density has a parametric form.

By contrast we derive a differential learning rule called EMMA that optimizes entropy by way of kernel density estimation. Entropy and its derivative can then be calculated by sampling from this density estimate. The resulting parameter update rule is surprisingly simple and efficient.

We will show how EMMA can be used to detect and correct corruption in magnetic resonance images (MRI). This application is beyond the scope of existing parametric entropy models.


Optimization of Entropy with Neural Networks

by N. N. Schraudolph. Ph.D. dissertation,
University of California, San Diego 1995
(introduction available separately)

The goal of unsupervised learning algorithms is to discover concise yet informative representations of large data sets; the minimum description length principle and exploratory projection pursuit are two representative attempts to formalize this notion. When implemented with neural networks, both suggest the minimization of entropy at the network's output as an objective for unsupervised learning.

The empirical computation of entropy or its derivative with respect to parameters of a neural network unfortunately requires explicit knowledge of the local data density; this information is typically not available when learning from data samples. This dissertation discusses and applies three methods for making density information accessible in a neural network: parametric modelling, probabilistic networks, and nonparametric estimation.

By imposing their own structure on the data, parametric density models implement impoverished but tractable forms of entropy such as the log-variance. We have used this method to improve the adaptive dynamics of an anti-Hebbian learning rule which has proven successful in extracting disparity from random stereograms.

In probabilistic networks, node activities are interpreted as the defining parameters of a stochastic process. The entropy of the process can then be calculated from its parameters, and hence optimized. The popular logistic activation function defines a binomial process in this manner; by optimizing the information gain of this process we derive a novel nonlinear Hebbian learning algorithm.

The nonparametric technique of Parzen window or kernel density estimation leads us to an entropy optimization algorithm in which the network adapts in response to the distance between pairs of data samples. We discuss distinct implementations for data-limited or memory-limited operation, and describe a maximum likelihood approach to setting the kernel shape, the regularizer for this technique. This method has been applied with great success to the problem of pose alignment in computer vision.

These experiments demonstrate a range of techniques that allow neural networks to learn concise representations of empirical data by minimizing its entropy. We have found that simple gradient descent in various entropy-based objective functions can lead to novel and useful algorithms for unsupervised neural network learning.


Plasticity-Mediated Competitive Learning

by N.N. Schraudolph and T.J. Sejnowski
Advances in Neural Information Processing Systems 7:475-480
MIT Press, Cambridge 1995

Differentiation between the nodes of a competitive learning network is conventionally achieved through competition on the basis of neural activity. Simple inhibitory mechanisms are limited to sparse representations, while decorrelation and factorization schemes that support distributed representations are computationally unattractive. By letting neural plasticity mediate the competitive interaction instead, we obtain diffuse, nonadaptive alternatives for fully distributed representations. We use this technique to simplify and improve our binary information gain optimization algorithm for feature extraction (Schraudolph & Sejnowski, 1993); the same approach could be used to improve other learning algorithms.


Temporal Difference Learning of Position Evaluation in the Game of Go

by N.N. Schraudolph, P. Dayan, and T.J. Sejnowski
Advances in Neural Information Processing Systems 6:817-824
Morgan Kaufmann, San Francisco 1994

Despite their facility at most other games of strategy, computers remain inept players of the game of Go. Its high branching factor defeats the tree search approach used in computer chess, while its long-range spatiotemporal interactions make position evaluation extremely challenging. Further development of conventional Go programs is hampered by their knowledge-intensive nature. We demonstrate a viable alternative by training neural networks to evaluate Go positions via temporal difference (TD) learning.

We developed network architectures that reflect the spatial organisation of both input and reinforcement signals on the Go board, and training protocols that provide exposure to competent (though unlabelled) play. These techniques yield far better performance than undifferentiated networks trained by self-play alone. A network with less than 500 weights learned within 3,000 games of 9x9 Go a position evaluation function that enables a primitive one-ply search to defeat a commercial Go program at a low playing level.

Note: Jay Scott has reviewed this paper for his archive on Machine Learning in Games.


Unsupervised Discrimination of Clustered Data via Optimization of Binary Information Gain

by N.N. Schraudolph and T.J. Sejnowski
Advances in Neural Information Processing Systems 5:499-506
Morgan Kaufmann, San Mateo 1993

We present the information-theoretic derivation of a learning algorithm that clusters unlabelled data with linear discriminants. In contrast to methods that try to preserve information about the input patterns, we maximize the information gained from observing the output of robust binary discriminators implemented with sigmoid nodes. We derive a local weight adaptation rule via gradient ascent in this objective, demonstrate its dynamics on some simple data sets, relate our approach to previous work and suggest directions in which it may be extended.


Competitive Anti-Hebbian Learning of Invariants

by N.N. Schraudolph and T.J. Sejnowski
Advances in Neural Information Processing Systems 4:1017-1024
Morgan Kaufmann, San Mateo 1992

Although the detection of invariant input structure is vital to many recognition tasks, connectionist learning rules tend to focus on directions of high variance (principal components). The prediction paradigm can be used to reconcile this dichotomy; here we offer a more direct, unsupervised approach to invariant learning based on an anti-Hebbian learning rule, and demonstrate its success in extracting coherent depth information from random stereograms.


Dynamic Parameter Encoding for Genetic Algorithms

by N.N. Schraudolph and R.K. Belew
Machine Learning, 9:9-21
Kluwer, Norwell 1992

The common use of static binary place-value codes for real-valued parameters of the phenotype in Holland's genetic algorithm (GA) forces either the sacrifice of representational precision for efficiency of search or vice versa. Dynamic Parameter Encoding (DPE) is a mechanism that avoids this dilemma by using convergence statistics derived from the GA population to adaptively control the mapping from fixed-length binary genes to real values. DPE is shown to be empirically effective and amenable to analysis; we explore the problem of premature convergence in GAs through two convergence models.


A User's Guide to GAucsd 1.4

by N.N. Schraudolph and J.J. Grefenstette
Technical Report CS92-249
Computer Science & Engr. Department
University of California, San Diego 1992

This document describes the GAucsd system for function optimization developed by Nici Schraudolph at UCSD on the basis of Genesis 4.5, a genetic algorithm package written by John J. Grefenstette. Genetic algorithms appear to hold a lot of promise as general purpose adaptive search procedures. However, the authors disclaim any warranties of fitness for a particular problem. The purpose of making this system available is to encourage the experimental use of genetic algorithms on realistic optimization problems, and thereby to identify the their strengths and weaknesses.


Evolving Networks: Using the Genetic Algorithm with Connectionist Learning

by R.K. Belew, J. McInerney, and N.N. Schraudolph
Artificial Life II, 511-547
Addison-Wesley, New York 1991

It is appealing to consider hybrids of neural network learnning algorithms with evolutionary search procedures simply because nature has successfully done so, suggesting that such hybrids may be more efficient than either technique applied in isolation. We survey recent work in this area and report our own experiments on using the GA to search the space of initial conditions for backpropagation networks.

We find that use of the GA provides much greater confidence in the face of the dependence on initial conditions that plague gradient techniques, and allows a reduction of individual training time by as much as two orders of magnitude. We conclude that the GA's global sampling characteristics complement connectionist local search techniques well, leading to efficient and robust hybrids.


03/98 - Nic Schraudolph