# Preprocessing

Digitised signals in their raw form may contain errors or gaps, and may contain more detail than
is actually useful. Before performing any processing, it is sometimes useful to perform
*preprocessing* in order to clean the data.

## Outliers

Some sensors, such as the sensitive instruments used to measure gas exchange, can produce
erroneous large values in response to electromagnetic disturbances or other external influences.
These spurious values, known as *spikes* or *outliers* can bias the overall signal and have a
damaging effect on any later processing. The typical solution is to remove them; however, this
raises the question of how big a spike needs to be before it is removed. Although the answer may
vary from one application to another, a typical rule is to classify any value which deviates
from the mean by more than three standard deviations as a spike. This is illustrated in Figure 4
where the mean is represented by the green dashed line and the orange dashed line corresponds
to the mean plus three times the standard deviation. There is one obvious spike which should be
removed. If it is important to keep the row so that no timesteps are lost, the spike value could
be replaced with the mean of the preceding and following values.

## Gaps

Gaps may appear in time series data for many reasons including sensor faults. Various strategies
can be employed for dealing with them, the simplest being to *forward fill* or *backward fill*.
In a forward-fill strategy, the previous measured value is used to fill the gap, and in a
backward-fill strategy, the next valid measured value is used. Forward filling is easy to
implement in a microprocessor sketch, and represents a simple choice of approach.

For more accuracy, an *interpolation* approach can be employed. With this strategy, the missing
values are calculated based on the last valid value before the gap and the first valid value
following the gap. Different mathematical functions can be used, but the simplest choice is
*linear interpolation* where the difference between the two known limits is divided by the
number of missing values plus one to find the change in value per timestep. The missing values are
then calculated by successively adding this value to the previous one. For example, if a value
of 10 is observed followed by three missing values and then 14, the step value is
\((14-10)/(3+1)=1\). Each missing value is therefore calculated by adding 1 to the preceding value.

## Smoothing

It is sometimes inconvenient if a signal has many abrupt peaks and troughs. Often, the quantity
of interest is actually the underlying trend rather than the fine detail. A good way to smooth
a choppy signal is to use a *moving average*. This involves sliding a window of fixed size over
original data and replacing the centre value with the mean of all the values that fall within the
window. Figure 5 illustrates the use of Excel to calculate a moving average with window size of 15
(MA_{15}) for a set of sea surface temperature readings.

When calculating a moving average, a little data is lost from the start and the end of the original range. The effect of this smoothing operation is shown in Figure 6 where the blue curve represents the orginal data, and the orange curve is the moving average. The larger the window, the more the original curve is smoothed.