Imputation

Overview

Imputation Scenarios

MPSTime supports univariate time-series imputation with three key patterns of missing data:

  1. Individual missing points (e.g., values missing at $t = 10, 20, 80$)
  2. Single contiguous blocks (e.g., values missing from $t = 25-70$)
  3. Multiple contiguous blocks (e.g., values missing from $t = 5-10$, $t = 25-50$ and $t = 80-90$)

MPSTime can also handle any combination of these patterns. For instance, you might need to impute a single contiguous block from $t = 10-30$, plus individual missing points at $t = 50$ and $t=80$.

Setup

The first step is to train an MPS. Here, we'll train an MPS in an unsupervised manner (no class labels) using an adapted version of the noisy trendy sinusoid dataset from the Classification tutorial.

using MPSTime, Random 
rng = Xoshiro(1); # fix rng seed
ntimepoints = 100; # specify number of samples per instance.
ntrain_instances = 300; # specify num training instances
ntest_instances = 200; # specify num test instances
X_train = trendy_sine(ntimepoints, ntrain_instances; sigma=0.2, slope=3, period=15, rng=rng)[1];
X_test = trendy_sine(ntimepoints, ntest_instances; sigma=0.2, slope=3, period=15, rng=rng)[1];
# hyper parameters and training
opts = MPSOptions(
    d=12, 
    chi_max=40, 
    sigmoid_transform=false # disabling preprocessing generally improves imputation performance.
) 
mps, info, test_states = fitMPS(X_train, opts);

Next, we initialize an imputation problem. This does a lot of necessary pre-computation:

imp = init_imputation_problem(mps, X_test)
Initialising train states.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
                         Summary:

 - Dataset has 300 training samples and 200 testing samples.
Slicing MPS into individual states...
 - 1 class(es) were detected.
 - Time independent encoding - Legendre - detected.
 - d = 12, chi_max = 40
Re-encoding the training data to get the encoding arguments...

 Created 1 ImputationProblem struct(s) containing class-wise mps and test samples.

A summary of the imputation problem setup is printed to verify the model parameters and dataset information. For multi-class data, you can pass y_test to init_imputation_problem in order to exploit the labels / class information while doing imputation.

Imputing missing values

Floating Point Error

Depending on the dataset, the results of fitMPS can be noticeably affected by what machine it is running on. If you're trying to replicate these tutorials, expect an average uncertainty of 1-2%. In rare cases, it can go up to 5%-10% for single imputation windows in particularly large datasets. We strongly recommend either using higher precision computing (pass dtype=BigFloat or Complex{BigFloat} to MPSOptions), or the evaluate function to resample your data and average the result. This is generally not significant for scientific computing applications as for real word datasets, the floating point error of up to a few percent is much less than the resampling error caused by choosing different train/test splits.

Single-block Imputation

Now, decide how you want to impute the missing data. The necessary options are:

  • class::Integer: The class of the time-series instance we are going to impute, leave as zero for "unlabelled" data (i.e., all data belong to the same class).
  • impute_sites: The MPS sites (time points) that are missing (inclusive).
  • instance_idx: The time-series instance from the chosen class in the test set.
  • method: The imputation method to use. Can be trajectories (ITS), median, mode, mean, etc...

In this example, we will consider a single block of contiguous missing values, simulated from a missing-at-random mechanism (MAR). We will use the median to impute the missing values, as well as computing a 1-Nearest Neighbor Imputation (1-NNI) benchmark for comparison:

rng = Xoshiro(42) # Fix RNG
class = 0
pm = 0.8 # 80% missing data
instance_idx = 1 # pick a time series instance in test set
_, impute_sites = mar(X_test[instance_idx, :], pm; rng=rng) # simulate MAR mechanism
method = :median

imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
    imp,
    class,
    instance_idx,
    impute_sites,
    method;
    NN_baseline=true, # whether to also do a baseline imputation using the (first) Nearest Neighbour benchmark
    plot_fits=true, # whether to plot the fits
    get_wmad=true # when method=:median, this uses the Weighted Median Absolute Deviation (WMAD) to compute the prediction error.
);

# Pretty-print the stats
using PrettyTables
pretty_table(stats[1]; header=["Metric", "Value"], header_crayon=crayon"yellow bold", tf=tf_unicode_rounded);
╭─────────┬──────────╮
│  Metric │    Value │
├─────────┼──────────┤
│     MAE │ 0.182723 │
│    MAPE │ 0.392298 │
│  NN_MAE │ 0.229749 │
│ NN_MAPE │ 0.484372 │
╰─────────┴──────────╯

Several outputs are returned from MPS_impute:

  • imputed_ts: The imputed time-series instance, containing the original data points and the predicted values.
  • pred_err: The prediction error for each imputed value, given a known ground-truth.
  • target_ts: The original time-series instance containing missing values.
  • stats: A collection of statistical metrics (MAE and MAPE) evaluating imputation performance with respect to a ground truth. Includes benchmark performance when NN_baseline=true.
  • plots: Stores plot object(s) in an array for visualization when plot_fits=true.

The MAE and MAPE in the 'stats' table are the Mean Absolute Error and Mean Absolute Percentage Error for the MPS prediction, while the NN_ prefix corresponds to the same errors for the 1-Nearest Neighbours benchmark. In this case, MAPE is an unreliable measure of error as the data goes through zero.

To plot the imputed time series, we can call the plot function as follows:

using Plots
plots[1]

The solid orange line depicts the "ground-truth" (observed) time-series values, the dotted blue line is the MPS-imputed data points and the dotted red line is the 1-NN benchmark. The blue shading indicates the uncertainty due to encoding error.

There are a lot of other options, and many more imputation methods to choose from! See MPS_impute for more details.

Multi-block Imputation

Building on the previous example of single-block imputation, MPSTime can also be used to impute missing values in multiple blocks of contiguous points. For example, consider missing points between $t = 10-25$, $t = 40-60$ and $t = 75-90$:

class = 0
impute_sites = vcat(collect(10:25), collect(40:60), collect(65:90))
instance_idx = 32
method = :median

imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
    imp,
    class,
    instance_idx,
    impute_sites,
    method;
    NN_baseline=true, # whether to also do a baseline imputation using the (first) Nearest Neighbour benchmark
    plot_fits=true, # whether to plot the fits
    get_wmad=true,
);
pretty_table(stats[1]; header=["Metric", "Value"], header_crayon=crayon"yellow bold", tf=tf_unicode_rounded);
╭─────────┬──────────╮
│  Metric │    Value │
├─────────┼──────────┤
│     MAE │ 0.199318 │
│    MAPE │ 0.222385 │
│  NN_MAE │ 0.276009 │
│ NN_MAPE │ 0.377745 │
╰─────────┴──────────╯
plots[1]

Individual Point Imputation

To impute individual points rather than ranges of consecutive points (blocks), we can simply pass their respective time points into the imputation function as a vector:

julia> impute_sites = [10]; # only impute t = 10
julia> impute_sites = [10, 25, 50]; # impute multiple individual points

Plotting Trajectories

To plot individual trajectories from the conditional distribution, use method=:ITS. Here, we'll plot 10 randomly selected trajectories for the missing points by setting the num_trajectories keyword:

class = 0
impute_sites = collect(10:90)
instance_idx = 59
method = :ITS

imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
    imp,
    class,
    instance_idx,
    impute_sites,
    method;
    NN_baseline=false, # whether to also do a baseline imputation using 1-NN
    plot_fits=true, # whether to plot the fits
    num_trajectories=10, # number of trajectories to plot
    rejection_threshold=2.5 # limits how unlikely we allow the random trajectories to be.
    # there are more options! see [`MPS_impute`](@ref)
);

stats
10-element Vector{Any}:
 Dict(:MAE => 0.3379187129486158, :MAPE => 0.5417349387056579)
 Dict(:MAE => 0.28142164786903906, :MAPE => 0.5284433919968549)
 Dict(:MAE => 0.32508625997795637, :MAPE => 0.6677156834479838)
 Dict(:MAE => 0.2590505415150819, :MAPE => 0.5175741141323031)
 Dict(:MAE => 0.38308884647433666, :MAPE => 0.3649478432250414)
 Dict(:MAE => 0.2520764281706289, :MAPE => 0.5122443865478925)
 Dict(:MAE => 0.29016616725543537, :MAPE => 0.4919483590734182)
 Dict(:MAE => 0.26135393274535457, :MAPE => 0.48943630062743604)
 Dict(:MAE => 0.29409516181993045, :MAPE => 0.5403912798238135)
 Dict(:MAE => 0.2789132922226536, :MAPE => 0.4387091550037073)
plots[1]

Plotting cumulative distribution functions

It can be interesting to inspect the probability distribution being sampled from at each missing time point. To enable this, we provide the get_cdfs function, which works very similarly to MPS_impute, only it returns the CDF at each missing time point in the encoding domain. Currently only supports method=:median.

using Plots
cdfs, ts, pred_err, target = get_cdfs(
    imp,
    class,
    instance_idx,
    impute_sites,
    # method; # Only supports method=:median
)
xvals = imp.x_guess_range.xvals[1:10:end]
plot(xvals, cdfs[1][1:10:end]; legend=:none)
p = last([plot!(xvals, cdfs[i][1:10:end]) for i in eachindex(cdfs)])
ylabel!("cdf(x)")
xlabel!("x_t")
title!("CDF at each time point.")

Docstrings

MPSTime.init_imputation_problemMethod
init_imputation_problem(W::TrainedMPS, X_test::AbstractMatrix, y_test::AbstractArray=zeros(Int, size(X_test,1)), [custom_encoding::MPSTime.Encoding]; <keyword arguments>) -> imp::ImputationProblem
init_imputation_problem(W::TrainedMPS, X_test::AbstractMatrix, [custom_encoding::MPSTime.Encoding]); <keyword arguments>) -> imp::ImputationProblem

Initialise an imputation problem using a trained MPS and relevent test data.

This involves a lot of pre-computation, which can be quite time intensive for data-driven bases. For unclassed/unsupervised data y_test may be omitted. If the MPS was trained with a custom encoding, then this encoding must be passed to init_imputation_problem.

Keyword Arguments

  • guess_range::Union{Nothing, Tuple{<:Real,<:Real}}=nothing: The range of values that guesses are allowed to take. This range is applied to normalised, encoding-adjusted time-series data. To allow any guess, leave as nothing, or set to encoding.range (e.g. [(-1., 1.) for the legendre encoding]).
  • dx::Float64 = 1e-4: The spacing between possible guesses in normalised, encoding-adjusted units. When imputing missing data with an MPS method, the imputed values will be selected from range(guess_range...; step=dx)
  • verbosity::Integer=1: The verbosity of the initialisation process. Useful for debugging, set to -1 to completely suppress output.
  • test_encoding::Bool=true: Whether to double check the encoding and scaling options are correct. This is strongly recommended but has a slight performance cost, so may be disabled.
  • static_xvecs::Bool=true: Whether to store encoded xvalues as StaticVectors. Usually improved performance
source
MPSTime.MPS_imputeFunction
MPS_impute(
    imp::ImputationProblem, 
    [class::Any,] 
    instance::Integer, 
    missing_sites::AbstractVector{<:Integer}, 
    [method::Symbol=:median]; 
    <keyword arguments>
) -> (imputed_instances::Vector, errors::Vector, targets::Vector, stats::Dict, plots::Vector{Plots.Plot})

Impute the missing_sites using an MPS-based approach, selecting the trajectory from the conditioned distribution with method.

The imputed time series, the imputation errors, the original time series, statistics about goodness of fit, and any plots generated by MPS_impute are returned by imputed_instances, errors, targets, stats, and plots respectively. These will always be length one vectors, unless the ITS or kNearestNeighbour method are used, which may cause MPS_impute to return multiple potential imputations with the associated errors, stats, and plots. Goodness of fit is always measured with respect to unscaled

See init_imputation_problem for constructing an ImputationProblem instance out of a trained MPS. The instance number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.

Imputation Methods

  • :median: For each missing value, compute the probability density function of the possible outcomes from the MPS, and choose the median. This method is the most robust to outliers. Keywords:

    • get_wmad::Bool=true: Whether to return an 'error' vector that computes the Weighted Median Absolute Deviation (WMAD) of each imputed value.
  • :mean: For each missing value, compute the probability density function of the possible outcomes from the MPS, and choose the expected value. Keywords:

    • get_std::Bool=true: Whether to return an 'error' vector that computes standard deviation of each imputed value.
  • :mode: For each missing value, choose the most likely outcome predicted by the MPS. Keywords:

    • max_jump::Union{Number,Nothing}=nothing: The largest jump allowed between two adjacent imputations. Leave as nothing to allow any jump. Helpful to suppress 'spikes' caused by poor support near the encoding domain edges.
  • :ITS: For each missing value, choose a value at random with probability weighted by the probability density function of the possible outcomes. Keywords:

    • rseed::Integer=1: Random seed for producing the trajectories.
    • num_trajectories::Integer=1: Number of trajectories to compute.
    • rejection_threshold::Union{Float64, Symbol}=:none: Number of WMADs allowed between adjacent points. Setting this low helps suppress rapidly varying trajectories that occur by bad luck.
    • max_trials::Integer=10: Number of attempts allowed to make guesses conform to rejection_threshold before giving up.
  • :kNearestNeighbour: Select the k nearest neighbours in the training set using Euclidean distance to the known data. Keyword:

    • k: Number of nearest neighbours to return. See kNN_impute
  • flatBaseline: Predict the missing values are just the mean of the training set.

Keyword Arguments

  • impute_order::Symbol=:forwards: Whether to impute the missing values :forwards (left to right) or :backwards (right to left)
  • NN_baseline::Bool=true: Whether to also impute the missing data with a k-Nearest Neighbour baseline.
  • n_baselines::Integer=1: How many nearest neighbour baselines to compute.
  • plot_fits::Bool=true: Whether to make a plot showing the target timeseries, the missing values, and the imputed region. If false, then p will be an empty vector. The plot will show the NN_baseline (if it was computed), as well as every trajectory if using the :ITS method.
  • get_metrics::Bool=true: Whether to compute imputation metrics, if false, then stats, will be empty.
  • full_metrics::Bool=false: Whether to compute every metric (MAPE, SMAPE, MAE, MSE, RMSE) or just MAE and MAPE.
  • print_metric_table::Bool=false: Whether to print the stats as a table.
  • invert_transform::Bool=true:, # Whether to undo the sigmoid transform/minmax normalisation before returning the imputed points. If this is false, imputed_instance, errors, target timeseries, stats, and plot y-axis will all be scaled by the data preprocessing / normalisation and fit to the encoding domain.
  • kwargs...: Extra keywords passed to MPS_impute are forwarded to the imputation method. See the Imputation Methods section.
source
MPSTime.get_cdfsFunction
get_cdfs(imp::ImputationProblem, 
           class::Any, 
           instance::Integer, 
           missing_sites::AbstractVector{<:Integer}, 
           [method::Symbol=:median]; 
           <keyword arguments>) -> (cdfs::Vector{Vector}, ts::Vector, pred_err::Vector, target_timeseries_full::Vector)

Impute the missing_sites using an MPS-based approach, selecting the trajectory from the conditioned distribution with method, and returns the cumulative distribution function used to infer each missing value.

See MPS_impute for a list of imputation methods and keyword arguments (does not support plotting, stats, or kNN baselines). See init_imputation_problem for constructing an ImputationProblem instance. The instance number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.

source
MPSTime.state_spaceFunction
state_space(T::Int, n::Int, s::Int=2; sigma::Float64=0.3, rng::AbstractRNG}=Random.GLOBAL_RNG) -> Matrix{Float64}

Generate n time series of length T each from a state space model with residual terms drawn from a normal distribution N(0, sigma) and lag order s. Time series are generated from the following model:

\[x_t = \mu_t + \theta_t + \eta_t \mu_t = \mu_{t-1} + \lambda_{t-1} + \xi_t \lambda_t = \lambda_{t-1} + \zeta_{t} \theta_t = \sum_{j=1}^{s-1} - \theta_{t-j} + \omega_t\]

where $x_t$ is the $t$-th value in the time series, and the residual terms $\eta_t$, $\xi_t$, $\zeta_t$ and $\omega_t$ are randomly drawn from a normal distribution $\mathcal{N}(0, \sigma)$.

Arguments

  • T – Time series length.
  • n – Number of time-series instances.

Keyword Arguments

  • s – Lag order (optional, default: 2).
  • sigma – Noise standard deviation (optional, default: 0.3).
  • rng – Random number generator of type AbstractRNG (optional, default: Random.GLOBAL_RNG).

Returns

A Matrix{Float64} of shape (n, T) containing the simulated time-series instances.

source

Internal imputation methods:

Internal imputation methods

MPSTime.impute_medianFunction

impute missing data points using the median of the conditional distribution (single site rdm ρ).

Arguments

  • class_mps::MPS:
  • opts::Options: MPS parameters.
  • enc_args::AbstractVector
  • x_guess_range::EncodedDataRange
  • timeseries::AbstractVector{<:Number}: The input time series data that will be imputed.
  • timeseries_enc::MPS: The encoded version of the time series represented as a product state.
  • imputation_sites::Vector{Int}: Indices in the time series where imputation is to be performed.
  • get_wmad::Bool: Whether to compute the weighted median absolute deviation (WMAD) during imputation (default is false).

Returns

A tuple containing:

  • median_values::Vector{Float64}: The imputed median values at the specified imputation sites.
  • wmad_value::Union{Nothing, Float64}: The weighted median absolute deviation if get_wmad is true; otherwise, nothing.
source
MPSTime.kNN_imputeFunction
kNN_impute(
    imp::ImputationProblem, 
    [class::Any,] 
    instance::Integer, 
    missing_sites::AbstractVector{<:Integer}; 
    k::Integer=1
) -> [neighbour1::Vector, neighbour2::Vector, ...]

Impute missing_sites using the k nearest neighbours in the test set, based on Euclidean distance.

See init_imputation_problem for constructing an ImputationProblem instance. The instance number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.

source