1.5. Scipy : high-level scientific computing

Authors: Adrien Chauve, Andre Espaze, Emmanuelle Gouillart, Gaël Varoquaux, Ralf Gommers

Scipy

The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc.

scipy can be compared to other standard scientific-computing libraries, such as the GSL (GNU Scientific Library for C and C++), or Matlab’s toolboxes. scipy is the core package for scientific routines in Python; it is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in hand.

Before implementing a routine, it is worth checking if the desired data processing is not already implemented in Scipy. As non-professional programmers, scientists often tend to re-invent the wheel, which leads to buggy, non-optimal, difficult-to-share and unmaintainable code. By contrast, Scipy‘s routines are optimized and tested, and should therefore be used when possible.

Warning

This tutorial is far from an introduction to numerical computing. As enumerating the different submodules and functions in scipy would be very boring, we concentrate instead on a few examples to give a general idea of how to use scipy for scientific computing.

scipy is composed of task-specific sub-modules:

scipy.cluster Vector quantization / Kmeans
scipy.constants Physical and mathematical constants
scipy.fftpack Fourier transform
scipy.integrate Integration routines
scipy.interpolate Interpolation
scipy.io Data input and output
scipy.linalg Linear algebra routines
scipy.ndimage n-dimensional image package
scipy.odr Orthogonal distance regression
scipy.optimize Optimization
scipy.signal Signal processing
scipy.sparse Sparse matrices
scipy.spatial Spatial data structures and algorithms
scipy.special Any special mathematical functions
scipy.stats Statistics

Tip

They all depend on numpy, but are mostly independent of each other. The standard way of importing Numpy and these Scipy modules is:

>>> import numpy as np
>>> from scipy import stats # same for other sub-modules

The main scipy namespace mostly contains functions that are really numpy functions (try scipy.cos is np.cos). Those are exposed for historical reasons only; there’s usually no reason to use import scipy in your code.

1.5.1. File input/output: scipy.io

  • Loading and saving matlab files:

    >>> from scipy import io as spio
    
    >>> a = np.ones((3, 3))
    >>> spio.savemat('file.mat', {'a': a}) # savemat expects a dictionary
    >>> data = spio.loadmat('file.mat', struct_as_record=True)
    >>> data['a']
    array([[ 1., 1., 1.],
    [ 1., 1., 1.],
    [ 1., 1., 1.]])
  • Reading images:

    >>> from scipy import misc
    
    >>> misc.imread('fname.png')
    array(...)
    >>> # Matplotlib also has a similar function
    >>> import matplotlib.pyplot as plt
    >>> plt.imread('fname.png')
    array(...)

See also:

1.5.2. Special functions: scipy.special

Special functions are transcendental functions. The docstring of the scipy.special module is well-written, so we won’t list all functions here. Frequently used ones are:

  • Bessel function, such as scipy.special.jn() (nth integer order Bessel function)
  • Elliptic function (scipy.special.ellipj() for the Jacobian elliptic function, ...)
  • Gamma function: scipy.special.gamma(), also note scipy.special.gammaln() which will give the log of Gamma to a higher numerical precision.
  • Erf, the area under a Gaussian curve: scipy.special.erf()

1.5.3. Linear algebra operations: scipy.linalg

The scipy.linalg module provides standard linear algebra operations, relying on an underlying efficient implementation (BLAS, LAPACK).

  • The scipy.linalg.det() function computes the determinant of a square matrix:

    >>> from scipy import linalg
    
    >>> arr = np.array([[1, 2],
    ... [3, 4]])
    >>> linalg.det(arr)
    -2.0
    >>> arr = np.array([[3, 2],
    ... [6, 4]])
    >>> linalg.det(arr)
    0.0
    >>> linalg.det(np.ones((3, 4)))
    Traceback (most recent call last):
    ...
    ValueError: expected square matrix
  • The scipy.linalg.inv() function computes the inverse of a square matrix:

    >>> arr = np.array([[1, 2],
    
    ... [3, 4]])
    >>> iarr = linalg.inv(arr)
    >>> iarr
    array([[-2. , 1. ],
    [ 1.5, -0.5]])
    >>> np.allclose(np.dot(arr, iarr), np.eye(2))
    True

    Finally computing the inverse of a singular matrix (its determinant is zero) will raise LinAlgError:

    >>> arr = np.array([[3, 2],
    
    ... [6, 4]])
    >>> linalg.inv(arr)
    Traceback (most recent call last):
    ...
    ...LinAlgError: singular matrix
  • More advanced operations are available, for example singular-value decomposition (SVD):

    >>> arr = np.arange(9).reshape((3, 3)) + np.diag([1, 0, 1])
    
    >>> uarr, spec, vharr = linalg.svd(arr)

    The resulting array spectrum is:

    >>> spec    
    
    array([ 14.88982544, 0.45294236, 0.29654967])

    The original matrix can be re-composed by matrix multiplication of the outputs of svd with np.dot:

    >>> sarr = np.diag(spec)
    
    >>> svd_mat = uarr.dot(sarr).dot(vharr)
    >>> np.allclose(svd_mat, arr)
    True

    SVD is commonly used in statistics and signal processing. Many other standard decompositions (QR, LU, Cholesky, Schur), as well as solvers for linear systems, are available in scipy.linalg.

1.5.4. Fast Fourier transforms: scipy.fftpack

The scipy.fftpack module allows to compute fast Fourier transforms. As an illustration, a (noisy) input signal may look like:

>>> time_step = 0.02
>>> period = 5.
>>> time_vec = np.arange(0, 20, time_step)
>>> sig = np.sin(2 * np.pi / period * time_vec) + \
... 0.5 * np.random.randn(time_vec.size)

The observer doesn’t know the signal frequency, only the sampling time step of the signal sig. The signal is supposed to come from a real function so the Fourier transform will be symmetric. The scipy.fftpack.fftfreq() function will generate the sampling frequencies and scipy.fftpack.fft() will compute the fast Fourier transform:

>>> from scipy import fftpack
>>> sample_freq = fftpack.fftfreq(sig.size, d=time_step)
>>> sig_fft = fftpack.fft(sig)

Because the resulting power is symmetric, only the positive part of the spectrum needs to be used for finding the frequency:

>>> pidxs = np.where(sample_freq > 0)
>>> freqs = sample_freq[pidxs]
>>> power = np.abs(sig_fft)[pidxs]

[source code, hires.png, pdf]

../_images/fftpack_frequency.png

The signal frequency can be found by:

>>> freq = freqs[power.argmax()]
>>> np.allclose(freq, 1./period) # check that correct freq is found
True

Now the high-frequency noise will be removed from the Fourier transformed signal:

>>> sig_fft[np.abs(sample_freq) > freq] = 0

The resulting filtered signal can be computed by the scipy.fftpack.ifft() function:

>>> main_sig = fftpack.ifft(sig_fft)

The result can be viewed with:

>>> import pylab as plt
>>> plt.figure()
<matplotlib.figure.Figure object at 0x...>
>>> plt.plot(time_vec, sig)
[<matplotlib.lines.Line2D object at 0x...>]
>>> plt.plot(time_vec, main_sig, linewidth=3)
[<matplotlib.lines.Line2D object at 0x...>]
>>> plt.xlabel('Time [s]')
<matplotlib.text.Text object at 0x...>
>>> plt.ylabel('Amplitude')
<matplotlib.text.Text object at 0x...>

[source code, hires.png, pdf]

../_images/fftpack_signals.png

numpy.fft

Numpy also has an implementation of FFT (numpy.fft). However, in general the scipy one should be preferred, as it uses more efficient underlying implementations.

Worked example: Crude periodicity finding

[source code, hires.png, pdf]

../_images/periodicity_finder_00.png

[source code, hires.png, pdf]

../_images/periodicity_finder_01.png

Worked example: Gaussian image blur

Convolution:

f_1(t) = \int dt'\, K(t-t') f_0(t')

\tilde{f}_1(\omega) = \tilde{K}(\omega) \tilde{f}_0(\omega)

[source code, hires.png, pdf]

../_images/image_blur.png

Exercise: Denoise moon landing image

../_images/moonlanding.png
  1. Examine the provided image moonlanding.png, which is heavily contaminated with periodic noise. In this exercise, we aim to clean up the noise using the Fast Fourier Transform.
  2. Load the image using pylab.imread().
  3. Find and use the 2-D FFT function in scipy.fftpack, and plot the spectrum (Fourier transform of) the image. Do you have any trouble visualising the spectrum? If so, why?
  4. The spectrum consists of high and low frequency components. The noise is contained in the high-frequency part of the spectrum, so set some of those components to zero (use array slicing).
  5. Apply the inverse Fourier transform to see the resulting image.

1.5.5. Statistics and random numbers: scipy.stats

The module scipy.stats contains statistical tools and probabilistic descriptions of random processes. Random number generators for various random process can be found in numpy.random.

1.5.5.1. Histogram and probability density function

Given observations of a random process, their histogram is an estimator of the random process’s PDF (probability density function):

>>> a = np.random.normal(size=1000)
>>> bins = np.arange(-4, 5)
>>> histogram = np.histogram(a, bins=bins, normed=True)[0]
>>> bins = 0.5*(bins[1:] + bins[:-1])
>>> from scipy import stats
>>> bins_ = np.arange(-4, 5, 0.1)
>>> b = stats.norm.pdf(bins_) # norm is a distribution
>>> plt.plot(bins, histogram)
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(bins_, b)
[<matplotlib.lines.Line2D object at ...>]

[source code, hires.png, pdf]

../_images/normal_distribution.png

If we know that the random process belongs to a given family of random processes, such as normal processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the underlying distribution. Here we fit a normal process to the observed data:

>>> loc, std = stats.norm.fit(a)
>>> loc
0.0314345570...
>>> std
0.9778613090...

Exercise: Probability distributions

Generate 1000 random variates from a gamma distribution with a shape parameter of 1, then plot a histogram from those samples. Can you plot the pdf on top (it should match)?

Extra: the distributions have a number of useful methods. Explore them by reading the docstring or by using IPython tab completion. Can you find the shape parameter of 1 back by using the fit method on your random variates?

1.5.5.2. Percentiles

The median is the value with half of the observations below, and half above:

>>> np.median(a)     
0.04041769593...

It is also called the percentile 50, because 50% of the observation are below it:

>>> stats.scoreatpercentile(a, 50)     
0.0404176959...

Similarly, we can calculate the percentile 90:

>>> stats.scoreatpercentile(a, 90)     
1.3185699120...

The percentile is an estimator of the CDF: cumulative distribution function.

1.5.5.3. Statistical tests

A statistical test is a decision indicator. For instance, if we have two sets of observations, that we assume are generated from Gaussian processes, we can use a T-test to decide whether the two sets of observations are significantly different:

>>> a = np.random.normal(0, 1, size=100)
>>> b = np.random.normal(1, 1, size=10)
>>> stats.ttest_ind(a, b)
(array(-3.177574054...), 0.0019370639...)

Tip

The resulting output is composed of:

  • The T statistic value: it is a number the sign of which is proportional to the difference between the two random processes and the magnitude is related to the significance of this difference.
  • the p value: the probability of both processes being identical. If it is close to 1, the two process are almost certainly identical. The closer it is to zero, the more likely it is that the processes have different means.

See also

The chapter on statistics introduces much more elaborate tools for statistical testing and statistical data loading and visualization outside of scipy.

1.5.6. Interpolation: scipy.interpolate

The scipy.interpolate is useful for fitting a function from experimental data and thus evaluating points where no measure exists. The module is based on the FITPACK Fortran subroutines from the netlib project.

By imagining experimental data close to a sine function:

>>> measured_time = np.linspace(0, 1, 10)
>>> noise = (np.random.random(10)*2 - 1) * 1e-1
>>> measures = np.sin(2 * np.pi * measured_time) + noise

The scipy.interpolate.interp1d class can build a linear interpolation function:

>>> from scipy.interpolate import interp1d
>>> linear_interp = interp1d(measured_time, measures)

Then the scipy.interpolate.linear_interp instance needs to be evaluated at the time of interest:

>>> computed_time = np.linspace(0, 1, 50)
>>> linear_results = linear_interp(computed_time)

A cubic interpolation can also be selected by providing the kind optional keyword argument:

>>> cubic_interp = interp1d(measured_time, measures, kind='cubic')
>>> cubic_results = cubic_interp(computed_time)

The results are now gathered on the following Matplotlib figure:

[source code, hires.png, pdf]

../_images/scipy_interpolation.png

scipy.interpolate.interp2d is similar to scipy.interpolate.interp1d, but for 2-D arrays. Note that for the interp family, the computed time must stay within the measured time range. See the summary exercise on Maximum wind speed prediction at the Sprogø station for a more advance spline interpolation example.

1.5.7. Optimization and fit: scipy.optimize

Optimization is the problem of finding a numerical solution to a minimization or equality.

The scipy.optimize module provides useful algorithms for function minimization (scalar or multi-dimensional), curve fitting and root finding.

>>> from scipy import optimize

Finding the minimum of a scalar function

Let’s define the following function:

>>> def f(x):
... return x**2 + 10*np.sin(x)

and plot it:

>>> x = np.arange(-10, 10, 0.1)
>>> plt.plot(x, f(x))
>>> plt.show()

[source code, hires.png, pdf]

../_images/scipy_optimize_example1.png

This function has a global minimum around -1.3 and a local minimum around 3.8.

The general and efficient way to find a minimum for this function is to conduct a gradient descent starting from a given initial point. The BFGS algorithm is a good way of doing this:

>>> optimize.fmin_bfgs(f, 0)
Optimization terminated successfully.
Current function value: -7.945823
Iterations: 5
Function evaluations: 24
Gradient evaluations: 8
array([-1.30644003])

A possible issue with this approach is that, if the function has local minima the algorithm may find these local minima instead of the global minimum depending on the initial point:

>>> optimize.fmin_bfgs(f, 3, disp=0)
array([ 3.83746663])

If we don’t know the neighborhood of the global minimum to choose the initial point, we need to resort to costlier global optimization. To find the global minimum, we use scipy.optimize.basinhopping() (which combines a local optimizer with stochastic sampling of starting points for the local optimizer):

New in version 0.12.0: basinhopping was added in version 0.12.0 of Scipy

>>> optimize.basinhopping(f, 0)  
nfev: 1725
minimization_failures: 0
fun: -7.9458233756152845
x: array([-1.30644001])
message: ['requested number of basinhopping iterations completed successfully']
njev: 575
nit: 100

Another available (but much less efficient) global optimizer is scipy.optimize.brute() (brute force optimization on a grid). More efficient algorithms for different classes of global optimization problems exist, but this is out of the scope of scipy. Some useful packages for global optimization are OpenOpt, IPOPT, PyGMO and PyEvolve.

Note

scipy used to contain the routine anneal, it has been deprecated since SciPy 0.14.0 and removed in SciPy 0.16.0.

To find the local minimum, let’s constraint the variable to the interval (0, 10) using scipy.optimize.fminbound():

>>> xmin_local = optimize.fminbound(f, 0, 10)
>>> xmin_local
3.8374671...

Note

Finding minima of function is discussed in more details in the advanced chapter: Mathematical optimization: finding minima of functions.

Exercise: 2-D minimization

[source code, hires.png, pdf]

../_images/scipy_optimize_sixhump.png

The six-hump camelback function

f(x, y) = (4 - 2.1x^2 + \frac{x^4}{3})x^2 + xy + (4y^2 - 4)y^2

has multiple global and local minima. Find the global minima of this function.

Hints:

How many global minima are there, and what is the function value at those points? What happens for an initial guess of (x, y) = (0, 0) ?

See the summary exercise on Non linear least squares curve fitting: application to point extraction in topographical lidar data for another, more advanced example.

1.5.8. Numerical integration: scipy.integrate

The most generic integration routine is scipy.integrate.quad():

>>> from scipy.integrate import quad
>>> res, err = quad(np.sin, 0, np.pi/2)
>>> np.allclose(res, 1)
True
>>> np.allclose(err, 1 - res)
True

Others integration schemes are available with fixed_quad, quadrature, romberg.

scipy.integrate also features routines for integrating Ordinary Differential Equations (ODE). In particular, scipy.integrate.odeint() is a general-purpose integrator using LSODA (Livermore Solver for Ordinary Differential equations with Automatic method switching for stiff and non-stiff problems), see the ODEPACK Fortran library for more details.

odeint solves first-order ODE systems of the form:

dy/dt = rhs(y1, y2, .., t0,...)

As an introduction, let us solve the ODE \frac{dy}{dt} = -2 y between t = 0 \dots 4, with the initial condition y(t=0) = 1. First the function computing the derivative of the position needs to be defined:

>>> def calc_derivative(ypos, time, counter_arr):
... counter_arr += 1
... return -2 * ypos
...

An extra argument counter_arr has been added to illustrate that the function may be called several times for a single time step, until solver convergence. The counter array is defined as:

>>> counter = np.zeros((1,), dtype=np.uint16)

The trajectory will now be computed:

>>> from scipy.integrate import odeint
>>> time_vec = np.linspace(0, 4, 40)
>>> yvec, info = odeint(calc_derivative, 1, time_vec,
... args=(counter,), full_output=True)

Thus the derivative function has been called more than 40 times (which was the number of time steps):

>>> counter
array([129], dtype=uint16)

and the cumulative number of iterations for each of the 10 first time steps can be obtained by:

>>> info['nfe'][:10]
array([31, 35, 43, 49, 53, 57, 59, 63, 65, 69], dtype=int32)

Note that the solver requires more iterations for the first time step. The solution yvec for the trajectory can now be plotted:

Another example with scipy.integrate.odeint() will be a damped spring-mass oscillator (2nd order oscillator). The position of a mass attached to a spring obeys the 2nd order ODE y'' + 2 \varepsilon \omega_0  y' + \omega_0^2 y = 0 with \omega_0^2 = k/m with k the spring constant, m the mass and \varepsilon = c/(2 m \omega_0) with c the damping coefficient. For this example, we choose the parameters as:

>>> mass = 0.5  # kg
>>> kspring = 4 # N/m
>>> cviscous = 0.4 # N s/m

so the system will be underdamped, because:

>>> eps = cviscous / (2 * mass * np.sqrt(kspring/mass))
>>> eps < 1
True

For the scipy.integrate.odeint() solver the 2nd order equation needs to be transformed in a system of two first-order equations for the vector Y = (y, y'). It will be convenient to define \nu = 2 \varepsilon * \omega_0 = c / m and \Omega = \omega_0^2 = k/m:

>>> nu_coef = cviscous / mass  # nu
>>> om_coef = kspring / mass # Omega

Thus the function will calculate the velocity and acceleration by:

>>> def calc_deri(yvec, time, nu, om):
... return (yvec[1], -nu * yvec[1] - om * yvec[0])
...
>>> time_vec = np.linspace(0, 10, 100)
>>> yinit = (1, 0)
>>> yarr = odeint(calc_deri, yinit, time_vec, args=(nu_coef, om_coef))

The final position and velocity are shown on the following Matplotlib figure:

[source code, hires.png, pdf]

../_images/odeint_damped_spring_mass.png

These two examples were only Ordinary Differential Equations (ODE). However, there is no Partial Differential Equations (PDE) solver in Scipy. Some Python packages for solving PDE’s are available, such as fipy or SfePy.

1.5.9. Summary exercises on scientific computing

The summary exercises use mainly Numpy, Scipy and Matplotlib. They provide some real-life examples of scientific computing with Python. Now that the basics of working with Numpy and Scipy have been introduced, the interested user is invited to try these exercises.

Exercises: