Anon. (2013), Functionals and the Functional Derivative, , 2013 pp. 1-123.
(pdf)
Anon. (2014), Div, Grad, Curl (spherical), , 2014 pp. 1-5.
(pdf)
MLA Lourakis and AA Argyros, Is Levenberg-Marquardt the most efficient optimization algorithm for implementing bundle adjustment?, Ieeexplore.Ieee.Org, .
In order to obtain optimal 3D structure and viewing parameter estimates, bundle adjustment is often used as the last step of feature-based structure and motion estimation algorithms. Bundle adjustment involves the formulation of a large scale, yet sparse minimization problem, which is traditionally solved using a sparse variant of the Levenberg-Marquardt optimization algorithm that avoids storing and operating on zero entries. This paper argues that considerable computational benefits can be gained by substituting the sparse …
(web, pdf)
J R Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain, , .
(web, pdf)
Atilim Gunes Baydin et al., Automatic differentiation in machine learning: a survey, Arxiv.Org, 2015 1502.05767v4, cs.SC.
Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply "autodiff", is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other's results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names "dynamic computational graphs" and "differentiable programming". We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms "autodiff", "automatic differentiation", and "symbolic differentiation" as these are encountered more and more in machine learning settings.
Published in: Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research, 18(153):1--43, 2018
(web, pdf)
Simon Becker et al., Geometry of Energy Landscapes and the Optimizability of Deep Neural Networks, Physical Review Letters, 2020 vol. 124 (10) p. 108301.
Deep neural networks are workhorse models in machine learning with multiple layers of nonlinear functions composed in series. Their loss function is highly nonconvex, yet empirically even gradient descent minimization is sufficient to arrive at accurate and predictive models. It is hitherto unknown why deep neural networks are easily optimizable. We analyze the energy landscape of a spin glass model of deep neural networks using random matrix theory and algebraic geometry. We analytically show that the multilayered structure holds the key to optimizability: Fixing the number of parameters and increasing network depth, the number of stationary points in the loss function decreases, minima become more clustered in parameter space, and the trade-off between the depth and width of minima becomes less severe. Our analytical results are numerically verified through comparison with neural networks trained on a set of classical benchmark datasets. Our model uncovers generic design principles of machine learning models.
(web, pdf)
Aurore Blelly et al., Stopping criteria, initialization, and implementations of BFGS and their effect on the BBOB test suite, The Genetic And Evolutionary Computation Conference Companion, New York, New York, USA 2018 pp. 1513-1517.
(web, pdf)
R W Hasse, One-Dimensional Wave Packet Solutions of Time Dependent Frictional Or Optical Potential Schrodinger Equations, Comput. Phys. Commun., 11 (1976) 353-362.
(web, pdf)
Fred James, MINUIT Tutorial, , 2006 pp. 1-44.
A large class of problems in many different fields of research can be reduced to the problem of finding the smallest value taken on by a function of one or more variable parameters. Examples come from fields as far apart as industrial processing (minimization of production costs) and general relativity (determination of geodesics by minimizing the path length between two points in curved space–time). But the classic example which occurs so often in scientific research is the estimation of unknown parameters in a theory by minimizing the difference (χ2) between theory and experimental data. In all these examples, the function to be minimized is of course determined by considerations proper to the particular field being investigated, which will not be addressed here. The main goal is to study the problem of minimization.
(pdf)
Diederik P Kingma and Jimmy Ba, Adam: A Method for Stochastic Optimization, Arxiv.Org, 2014 1412.6980, cs.LG.
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
(web, pdf)
R B Wu et al., Data-driven gradient algorithm for high-precision quantum control, Physical Review A, .
In the quest to achieve scalable quantum information processing technologies, gradient-based optimal control algorithms (e.g., GRAPE) are broadly used for implementing high-precision quantum gates, but their performance is often hindered by deterministic or random errors in the system model and the control electronics. In this paper, we show that GRAPE can be taught to be more effective by jointly learning from the design model and the experimental data obtained from process tomography. The resulting data-driven gradient optimization algorithm (d-GRAPE) can in principle correct all deterministic gate errors, with a mild efficiency loss. The d-GRAPE algorithm may become more powerful with broadband controls that involve a large number of control parameters, while other algorithms usually slow down due to the increased size of the search space. These advantages are demonstrated by simulating the implementation of a two-qubit CNOT gate.
(web, pdf)