1.5. Scipy : high-level scientific computing¶
Authors: Adrien Chauve, Andre Espaze, Emmanuelle Gouillart, Gaël Varoquaux, Ralf Gommers
Scipy
The scipy
package contains various toolboxes dedicated to common
issues in scientific computing. Its different submodules correspond
to different applications, such as interpolation, integration,
optimization, image processing, statistics, special functions, etc.
scipy
can be compared to other standard scientific-computing
libraries, such as the GSL (GNU Scientific Library for C and C++),
or Matlab’s toolboxes. scipy
is the core package for scientific
routines in Python; it is meant to operate efficiently on numpy
arrays, so that numpy and scipy work hand in hand.
Before implementing a routine, it is worth checking if the desired
data processing is not already implemented in Scipy. As
non-professional programmers, scientists often tend to re-invent the
wheel, which leads to buggy, non-optimal, difficult-to-share and
unmaintainable code. By contrast, Scipy
‘s routines are optimized
and tested, and should therefore be used when possible.
Chapters contents
- File input/output:
scipy.io
- Special functions:
scipy.special
- Linear algebra operations:
scipy.linalg
- Fast Fourier transforms:
scipy.fftpack
- Statistics and random numbers:
scipy.stats
- Interpolation:
scipy.interpolate
- Optimization and fit:
scipy.optimize
- Numerical integration:
scipy.integrate
- Summary exercises on scientific computing
Warning
This tutorial is far from an introduction to numerical computing.
As enumerating the different submodules and functions in scipy would
be very boring, we concentrate instead on a few examples to give a
general idea of how to use scipy
for scientific computing.
scipy
is composed of task-specific sub-modules:
scipy.cluster |
Vector quantization / Kmeans |
scipy.constants |
Physical and mathematical constants |
scipy.fftpack |
Fourier transform |
scipy.integrate |
Integration routines |
scipy.interpolate |
Interpolation |
scipy.io |
Data input and output |
scipy.linalg |
Linear algebra routines |
scipy.ndimage |
n-dimensional image package |
scipy.odr |
Orthogonal distance regression |
scipy.optimize |
Optimization |
scipy.signal |
Signal processing |
scipy.sparse |
Sparse matrices |
scipy.spatial |
Spatial data structures and algorithms |
scipy.special |
Any special mathematical functions |
scipy.stats |
Statistics |
Tip
They all depend on numpy
, but are mostly independent of each
other. The standard way of importing Numpy and these Scipy modules
is:
>>> import numpy as np
>>> from scipy import stats # same for other sub-modules
The main scipy
namespace mostly contains functions that are really
numpy functions (try scipy.cos is np.cos
). Those are exposed for
historical reasons only; there’s usually no reason to use import
scipy
in your code.
1.5.1. File input/output: scipy.io
¶
Loading and saving matlab files:
>>> from scipy import io as spio >>> a = np.ones((3, 3)) >>> spio.savemat('file.mat', {'a': a}) # savemat expects a dictionary >>> data = spio.loadmat('file.mat', struct_as_record=True) >>> data['a'] array([[ 1., 1., 1.], [ 1., 1., 1.], [ 1., 1., 1.]])
Reading images:
>>> from scipy import misc >>> misc.imread('fname.png') array(...) >>> # Matplotlib also has a similar function >>> import matplotlib.pyplot as plt >>> plt.imread('fname.png') array(...)
See also:
- Load text files:
numpy.loadtxt()
/numpy.savetxt()
- Clever loading of text/csv files:
numpy.genfromtxt()
/numpy.recfromcsv()
- Fast and efficient, but numpy-specific, binary format:
numpy.save()
/numpy.load()
1.5.2. Special functions: scipy.special
¶
Special functions are transcendental functions. The docstring of the
scipy.special
module is well-written, so we won’t list all
functions here. Frequently used ones are:
- Bessel function, such as
scipy.special.jn()
(nth integer order Bessel function)- Elliptic function (
scipy.special.ellipj()
for the Jacobian elliptic function, ...)- Gamma function:
scipy.special.gamma()
, also notescipy.special.gammaln()
which will give the log of Gamma to a higher numerical precision.- Erf, the area under a Gaussian curve:
scipy.special.erf()
1.5.3. Linear algebra operations: scipy.linalg
¶
The scipy.linalg
module provides standard linear algebra
operations, relying on an underlying efficient implementation (BLAS,
LAPACK).
The
scipy.linalg.det()
function computes the determinant of a square matrix:>>> from scipy import linalg >>> arr = np.array([[1, 2], ... [3, 4]]) >>> linalg.det(arr) -2.0 >>> arr = np.array([[3, 2], ... [6, 4]]) >>> linalg.det(arr) 0.0 >>> linalg.det(np.ones((3, 4))) Traceback (most recent call last): ... ValueError: expected square matrix
The
scipy.linalg.inv()
function computes the inverse of a square matrix:>>> arr = np.array([[1, 2], ... [3, 4]]) >>> iarr = linalg.inv(arr) >>> iarr array([[-2. , 1. ], [ 1.5, -0.5]]) >>> np.allclose(np.dot(arr, iarr), np.eye(2)) True
Finally computing the inverse of a singular matrix (its determinant is zero) will raise
LinAlgError
:>>> arr = np.array([[3, 2], ... [6, 4]]) >>> linalg.inv(arr) Traceback (most recent call last): ... ...LinAlgError: singular matrix
More advanced operations are available, for example singular-value decomposition (SVD):
>>> arr = np.arange(9).reshape((3, 3)) + np.diag([1, 0, 1]) >>> uarr, spec, vharr = linalg.svd(arr)
The resulting array spectrum is:
>>> spec array([ 14.88982544, 0.45294236, 0.29654967])
The original matrix can be re-composed by matrix multiplication of the outputs of
svd
withnp.dot
:>>> sarr = np.diag(spec) >>> svd_mat = uarr.dot(sarr).dot(vharr) >>> np.allclose(svd_mat, arr) True
SVD is commonly used in statistics and signal processing. Many other standard decompositions (QR, LU, Cholesky, Schur), as well as solvers for linear systems, are available in
scipy.linalg
.
1.5.4. Fast Fourier transforms: scipy.fftpack
¶
The scipy.fftpack
module allows to compute fast Fourier transforms.
As an illustration, a (noisy) input signal may look like:
>>> time_step = 0.02
>>> period = 5.
>>> time_vec = np.arange(0, 20, time_step)
>>> sig = np.sin(2 * np.pi / period * time_vec) + \
... 0.5 * np.random.randn(time_vec.size)
The observer doesn’t know the signal frequency, only
the sampling time step of the signal sig
. The signal
is supposed to come from a real function so the Fourier transform
will be symmetric.
The scipy.fftpack.fftfreq()
function will generate the sampling frequencies and
scipy.fftpack.fft()
will compute the fast Fourier transform:
>>> from scipy import fftpack
>>> sample_freq = fftpack.fftfreq(sig.size, d=time_step)
>>> sig_fft = fftpack.fft(sig)
Because the resulting power is symmetric, only the positive part of the spectrum needs to be used for finding the frequency:
>>> pidxs = np.where(sample_freq > 0)
>>> freqs = sample_freq[pidxs]
>>> power = np.abs(sig_fft)[pidxs]
[source code, hires.png, pdf]
The signal frequency can be found by:
>>> freq = freqs[power.argmax()]
>>> np.allclose(freq, 1./period) # check that correct freq is found
True
Now the high-frequency noise will be removed from the Fourier transformed signal:
>>> sig_fft[np.abs(sample_freq) > freq] = 0
The resulting filtered signal can be computed by the
scipy.fftpack.ifft()
function:
>>> main_sig = fftpack.ifft(sig_fft)
The result can be viewed with:
>>> import pylab as plt
>>> plt.figure()
<matplotlib.figure.Figure object at 0x...>
>>> plt.plot(time_vec, sig)
[<matplotlib.lines.Line2D object at 0x...>]
>>> plt.plot(time_vec, main_sig, linewidth=3)
[<matplotlib.lines.Line2D object at 0x...>]
>>> plt.xlabel('Time [s]')
<matplotlib.text.Text object at 0x...>
>>> plt.ylabel('Amplitude')
<matplotlib.text.Text object at 0x...>
[source code, hires.png, pdf]
numpy.fft
Numpy also has an implementation of FFT (numpy.fft
). However,
in general the scipy one
should be preferred, as it uses more efficient underlying implementations.
Worked example: Crude periodicity finding
[source code, hires.png, pdf]
[source code, hires.png, pdf]
Exercise: Denoise moon landing image
- Examine the provided image moonlanding.png, which is heavily contaminated with periodic noise. In this exercise, we aim to clean up the noise using the Fast Fourier Transform.
- Load the image using
pylab.imread()
. - Find and use the 2-D FFT function in
scipy.fftpack
, and plot the spectrum (Fourier transform of) the image. Do you have any trouble visualising the spectrum? If so, why? - The spectrum consists of high and low frequency components. The noise is contained in the high-frequency part of the spectrum, so set some of those components to zero (use array slicing).
- Apply the inverse Fourier transform to see the resulting image.
1.5.5. Statistics and random numbers: scipy.stats
¶
The module scipy.stats
contains statistical tools and probabilistic
descriptions of random processes. Random number generators for various
random process can be found in numpy.random
.
1.5.5.1. Histogram and probability density function¶
Given observations of a random process, their histogram is an estimator of the random process’s PDF (probability density function):
>>> a = np.random.normal(size=1000)
>>> bins = np.arange(-4, 5)
>>> histogram = np.histogram(a, bins=bins, normed=True)[0]
>>> bins = 0.5*(bins[1:] + bins[:-1])
>>> from scipy import stats
>>> bins_ = np.arange(-4, 5, 0.1)
>>> b = stats.norm.pdf(bins_) # norm is a distribution
>>> plt.plot(bins, histogram)
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(bins_, b)
[<matplotlib.lines.Line2D object at ...>]
[source code, hires.png, pdf]
If we know that the random process belongs to a given family of random processes, such as normal processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the underlying distribution. Here we fit a normal process to the observed data:
>>> loc, std = stats.norm.fit(a)
>>> loc
0.0314345570...
>>> std
0.9778613090...
Exercise: Probability distributions
Generate 1000 random variates from a gamma distribution with a shape parameter of 1, then plot a histogram from those samples. Can you plot the pdf on top (it should match)?
Extra: the distributions have a number of useful methods. Explore them by
reading the docstring or by using IPython tab completion. Can you find the
shape parameter of 1 back by using the fit
method on your random
variates?
1.5.5.2. Percentiles¶
The median is the value with half of the observations below, and half above:
>>> np.median(a)
0.04041769593...
It is also called the percentile 50, because 50% of the observation are below it:
>>> stats.scoreatpercentile(a, 50)
0.0404176959...
Similarly, we can calculate the percentile 90:
>>> stats.scoreatpercentile(a, 90)
1.3185699120...
The percentile is an estimator of the CDF: cumulative distribution function.
1.5.5.3. Statistical tests¶
A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from Gaussian processes, we can use a T-test to decide whether the two sets of observations are significantly different:
>>> a = np.random.normal(0, 1, size=100)
>>> b = np.random.normal(1, 1, size=10)
>>> stats.ttest_ind(a, b)
(array(-3.177574054...), 0.0019370639...)
Tip
The resulting output is composed of:
- The T statistic value: it is a number the sign of which is proportional to the difference between the two random processes and the magnitude is related to the significance of this difference.
- the p value: the probability of both processes being identical. If it is close to 1, the two process are almost certainly identical. The closer it is to zero, the more likely it is that the processes have different means.
See also
The chapter on statistics introduces much more elaborate tools for statistical testing and statistical data loading and visualization outside of scipy.
1.5.6. Interpolation: scipy.interpolate
¶
The scipy.interpolate
is useful for fitting a function from experimental
data and thus evaluating points where no measure exists. The module is based
on the FITPACK Fortran subroutines from the netlib project.
By imagining experimental data close to a sine function:
>>> measured_time = np.linspace(0, 1, 10)
>>> noise = (np.random.random(10)*2 - 1) * 1e-1
>>> measures = np.sin(2 * np.pi * measured_time) + noise
The scipy.interpolate.interp1d
class can build a linear
interpolation function:
>>> from scipy.interpolate import interp1d
>>> linear_interp = interp1d(measured_time, measures)
Then the scipy.interpolate.linear_interp
instance needs to be
evaluated at the time of interest:
>>> computed_time = np.linspace(0, 1, 50)
>>> linear_results = linear_interp(computed_time)
A cubic interpolation can also be selected by providing the kind
optional
keyword argument:
>>> cubic_interp = interp1d(measured_time, measures, kind='cubic')
>>> cubic_results = cubic_interp(computed_time)
The results are now gathered on the following Matplotlib figure:
[source code, hires.png, pdf]
scipy.interpolate.interp2d
is similar to
scipy.interpolate.interp1d
, but for 2-D arrays. Note that for
the interp family, the computed time must stay within the measured
time range. See the summary exercise on
Maximum wind speed prediction at the Sprogø station for a more advance spline
interpolation example.
1.5.7. Optimization and fit: scipy.optimize
¶
Optimization is the problem of finding a numerical solution to a minimization or equality.
The scipy.optimize
module provides useful algorithms for function
minimization (scalar or multi-dimensional), curve fitting and root
finding.
>>> from scipy import optimize
Finding the minimum of a scalar function
Let’s define the following function:
>>> def f(x):
... return x**2 + 10*np.sin(x)
and plot it:
>>> x = np.arange(-10, 10, 0.1)
>>> plt.plot(x, f(x))
>>> plt.show()
[source code, hires.png, pdf]
This function has a global minimum around -1.3 and a local minimum around 3.8.
The general and efficient way to find a minimum for this function is to conduct a gradient descent starting from a given initial point. The BFGS algorithm is a good way of doing this:
>>> optimize.fmin_bfgs(f, 0)
Optimization terminated successfully.
Current function value: -7.945823
Iterations: 5
Function evaluations: 24
Gradient evaluations: 8
array([-1.30644003])
A possible issue with this approach is that, if the function has local minima the algorithm may find these local minima instead of the global minimum depending on the initial point:
>>> optimize.fmin_bfgs(f, 3, disp=0)
array([ 3.83746663])
If we don’t know the neighborhood of the global minimum to choose the initial
point, we need to resort to costlier global optimization. To find the global
minimum, we use scipy.optimize.basinhopping()
(which combines a local
optimizer with stochastic sampling of starting points for the local optimizer):
New in version 0.12.0: basinhopping was added in version 0.12.0 of Scipy
>>> optimize.basinhopping(f, 0)
nfev: 1725
minimization_failures: 0
fun: -7.9458233756152845
x: array([-1.30644001])
message: ['requested number of basinhopping iterations completed successfully']
njev: 575
nit: 100
Another available (but much less efficient) global optimizer is
scipy.optimize.brute()
(brute force optimization on a grid).
More efficient algorithms
for different classes of global optimization problems exist, but this is out of
the scope of scipy
. Some useful packages for global optimization are
OpenOpt, IPOPT, PyGMO and PyEvolve.
Note
scipy
used to contain the routine anneal, it has been deprecated since
SciPy 0.14.0 and removed in SciPy 0.16.0.
To find the local minimum, let’s constraint the variable to the interval
(0, 10)
using scipy.optimize.fminbound()
:
>>> xmin_local = optimize.fminbound(f, 0, 10)
>>> xmin_local
3.8374671...
Note
Finding minima of function is discussed in more details in the advanced chapter: Mathematical optimization: finding minima of functions.
Exercise: 2-D minimization
[source code, hires.png, pdf]
The six-hump camelback function
has multiple global and local minima. Find the global minima of this function.
Hints:
- Variables can be restricted to and .
- Use
numpy.meshgrid()
andpylab.imshow()
to find visually the regions.- Use
scipy.optimize.fmin_bfgs()
or another multi-dimensional minimizer.How many global minima are there, and what is the function value at those points? What happens for an initial guess of ?
See the summary exercise on Non linear least squares curve fitting: application to point extraction in topographical lidar data for another, more advanced example.
1.5.8. Numerical integration: scipy.integrate
¶
The most generic integration routine is scipy.integrate.quad()
:
>>> from scipy.integrate import quad
>>> res, err = quad(np.sin, 0, np.pi/2)
>>> np.allclose(res, 1)
True
>>> np.allclose(err, 1 - res)
True
Others integration schemes are available with fixed_quad
,
quadrature
, romberg
.
scipy.integrate
also features routines for integrating Ordinary
Differential Equations (ODE). In particular, scipy.integrate.odeint()
is a general-purpose integrator using LSODA (Livermore Solver for
Ordinary Differential equations with Automatic method switching
for stiff and non-stiff problems), see the ODEPACK Fortran library
for more details.
odeint
solves first-order ODE systems of the form:
dy/dt = rhs(y1, y2, .., t0,...)
As an introduction, let us solve the ODE between , with the initial condition . First the function computing the derivative of the position needs to be defined:
>>> def calc_derivative(ypos, time, counter_arr):
... counter_arr += 1
... return -2 * ypos
...
An extra argument counter_arr
has been added to illustrate that the
function may be called several times for a single time step, until solver
convergence. The counter array is defined as:
>>> counter = np.zeros((1,), dtype=np.uint16)
The trajectory will now be computed:
>>> from scipy.integrate import odeint
>>> time_vec = np.linspace(0, 4, 40)
>>> yvec, info = odeint(calc_derivative, 1, time_vec,
... args=(counter,), full_output=True)
Thus the derivative function has been called more than 40 times (which was the number of time steps):
>>> counter
array([129], dtype=uint16)
and the cumulative number of iterations for each of the 10 first time steps can be obtained by:
>>> info['nfe'][:10]
array([31, 35, 43, 49, 53, 57, 59, 63, 65, 69], dtype=int32)
Note that the solver requires more iterations for the first time step.
The solution yvec
for the trajectory can now be plotted:
[source code, hires.png, pdf]
Another example with scipy.integrate.odeint()
will be a damped
spring-mass oscillator (2nd order oscillator).
The position of a mass attached to a spring obeys the 2nd order ODE
with
with the spring constant, the mass
and with the damping coefficient.
For this example, we choose the parameters as:
>>> mass = 0.5 # kg
>>> kspring = 4 # N/m
>>> cviscous = 0.4 # N s/m
so the system will be underdamped, because:
>>> eps = cviscous / (2 * mass * np.sqrt(kspring/mass))
>>> eps < 1
True
For the scipy.integrate.odeint()
solver the 2nd order equation
needs to be transformed in a system of two first-order equations for
the vector . It will be convenient to define
and :
>>> nu_coef = cviscous / mass # nu
>>> om_coef = kspring / mass # Omega
Thus the function will calculate the velocity and acceleration by:
>>> def calc_deri(yvec, time, nu, om):
... return (yvec[1], -nu * yvec[1] - om * yvec[0])
...
>>> time_vec = np.linspace(0, 10, 100)
>>> yinit = (1, 0)
>>> yarr = odeint(calc_deri, yinit, time_vec, args=(nu_coef, om_coef))
The final position and velocity are shown on the following Matplotlib figure:
[source code, hires.png, pdf]
These two examples were only Ordinary Differential Equations (ODE). However, there is no Partial Differential Equations (PDE) solver in Scipy. Some Python packages for solving PDE’s are available, such as fipy or SfePy.
1.5.9. Summary exercises on scientific computing¶
The summary exercises use mainly Numpy, Scipy and Matplotlib. They provide some real-life examples of scientific computing with Python. Now that the basics of working with Numpy and Scipy have been introduced, the interested user is invited to try these exercises.
Exercises: