Imputation
Overview
Imputation Scenarios
MPSTime supports univariate time-series imputation with three key patterns of missing data:
- Individual missing points (e.g., values missing at $t = 10, 20, 80$)
- Single contiguous blocks (e.g., values missing from $t = 25-70$)
- Multiple contiguous blocks (e.g., values missing from $t = 5-10$, $t = 25-50$ and $t = 80-90$)
MPSTime can also handle any combination of these patterns. For instance, you might need to impute a single contiguous block from $t = 10-30$, plus individual missing points at $t = 50$ and $t=80$.
Setup
The first step is to train an MPS. Here, we'll train an MPS in an unsupervised manner (no class labels) using an adapted version of the noisy trendy sinusoid dataset from the Classification
tutorial.
using MPSTime, Random
rng = Xoshiro(1); # fix rng seed
ntimepoints = 100; # specify number of samples per instance.
ntrain_instances = 300; # specify num training instances
ntest_instances = 200; # specify num test instances
X_train = trendy_sine(ntimepoints, ntrain_instances; sigma=0.2, slope=3, period=15, rng=rng)[1];
X_test = trendy_sine(ntimepoints, ntest_instances; sigma=0.2, slope=3, period=15, rng=rng)[1];
# hyper parameters and training
opts = MPSOptions(
d=12,
chi_max=40,
sigmoid_transform=false # disabling preprocessing generally improves imputation performance.
)
mps, info, test_states = fitMPS(X_train, opts);
Next, we initialize an imputation problem. This does a lot of necessary pre-computation:
imp = init_imputation_problem(mps, X_test)
Initialising train states.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Summary:
- Dataset has 300 training samples and 200 testing samples.
Slicing MPS into individual states...
- 1 class(es) were detected.
- Time independent encoding - Legendre - detected.
- d = 12, chi_max = 40
Re-encoding the training data to get the encoding arguments...
Created 1 ImputationProblem struct(s) containing class-wise mps and test samples.
A summary of the imputation problem setup is printed to verify the model parameters and dataset information. For multi-class data, you can pass y_test
to init_imputation_problem
in order to exploit the labels / class information while doing imputation.
Imputing missing values
Depending on the dataset, the results of fitMPS
can be noticeably affected by what machine it is running on. If you're trying to replicate these tutorials, expect an average uncertainty of 1-2%. In rare cases, it can go up to 5%-10% for single imputation windows in particularly large datasets. We strongly recommend either using higher precision computing (pass dtype=BigFloat
or Complex{BigFloat}
to MPSOptions
), or the evaluate
function to resample your data and average the result. This is generally not significant for scientific computing applications as for real word datasets, the floating point error of up to a few percent is much less than the resampling error caused by choosing different train/test splits.
Single-block Imputation
Now, decide how you want to impute the missing data. The necessary options are:
class::Integer
: The class of the time-series instance we are going to impute, leave as zero for "unlabelled" data (i.e., all data belong to the same class).impute_sites
: The MPS sites (time points) that are missing (inclusive).instance_idx
: The time-series instance from the chosen class in the test set.method
: The imputation method to use. Can be trajectories (ITS), median, mode, mean, etc...
In this example, we will consider a single block of contiguous missing values, simulated from a missing-at-random mechanism (MAR). We will use the median to impute the missing values, as well as computing a 1-Nearest Neighbor Imputation (1-NNI) benchmark for comparison:
rng = Xoshiro(42) # Fix RNG
class = 0
pm = 0.8 # 80% missing data
instance_idx = 1 # pick a time series instance in test set
_, impute_sites = mar(X_test[instance_idx, :], pm; rng=rng) # simulate MAR mechanism
method = :median
imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
imp,
class,
instance_idx,
impute_sites,
method;
NN_baseline=true, # whether to also do a baseline imputation using the (first) Nearest Neighbour benchmark
plot_fits=true, # whether to plot the fits
get_wmad=true # when method=:median, this uses the Weighted Median Absolute Deviation (WMAD) to compute the prediction error.
);
# Pretty-print the stats
using PrettyTables
pretty_table(stats[1]; header=["Metric", "Value"], header_crayon=crayon"yellow bold", tf=tf_unicode_rounded);
╭─────────┬──────────╮
│ Metric │ Value │
├─────────┼──────────┤
│ MAE │ 0.182723 │
│ MAPE │ 0.392298 │
│ NN_MAE │ 0.229749 │
│ NN_MAPE │ 0.484372 │
╰─────────┴──────────╯
Several outputs are returned from MPS_impute
:
imputed_ts
: The imputed time-series instance, containing the original data points and the predicted values.pred_err
: The prediction error for each imputed value, given a known ground-truth.target_ts
: The original time-series instance containing missing values.stats
: A collection of statistical metrics (MAE and MAPE) evaluating imputation performance with respect to a ground truth. Includes benchmark performance whenNN_baseline=true
.plots
: Stores plot object(s) in an array for visualization whenplot_fits=true
.
The MAE and MAPE in the 'stats' table are the Mean Absolute Error and Mean Absolute Percentage Error for the MPS prediction, while the NN_ prefix corresponds to the same errors for the 1-Nearest Neighbours benchmark. In this case, MAPE is an unreliable measure of error as the data goes through zero.
To plot the imputed time series, we can call the plot function as follows:
using Plots
plots[1]
The solid orange line depicts the "ground-truth" (observed) time-series values, the dotted blue line is the MPS-imputed data points and the dotted red line is the 1-NN benchmark. The blue shading indicates the uncertainty due to encoding error.
There are a lot of other options, and many more imputation methods to choose from! See MPS_impute
for more details.
Multi-block Imputation
Building on the previous example of single-block imputation, MPSTime can also be used to impute missing values in multiple blocks of contiguous points. For example, consider missing points between $t = 10-25$, $t = 40-60$ and $t = 75-90$:
class = 0
impute_sites = vcat(collect(10:25), collect(40:60), collect(65:90))
instance_idx = 32
method = :median
imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
imp,
class,
instance_idx,
impute_sites,
method;
NN_baseline=true, # whether to also do a baseline imputation using the (first) Nearest Neighbour benchmark
plot_fits=true, # whether to plot the fits
get_wmad=true,
);
pretty_table(stats[1]; header=["Metric", "Value"], header_crayon=crayon"yellow bold", tf=tf_unicode_rounded);
╭─────────┬──────────╮
│ Metric │ Value │
├─────────┼──────────┤
│ MAE │ 0.199318 │
│ MAPE │ 0.222385 │
│ NN_MAE │ 0.276009 │
│ NN_MAPE │ 0.377745 │
╰─────────┴──────────╯
plots[1]
Individual Point Imputation
To impute individual points rather than ranges of consecutive points (blocks), we can simply pass their respective time points into the imputation function as a vector:
julia> impute_sites = [10]; # only impute t = 10
julia> impute_sites = [10, 25, 50]; # impute multiple individual points
Plotting Trajectories
To plot individual trajectories from the conditional distribution, use method=:ITS
. Here, we'll plot 10 randomly selected trajectories for the missing points by setting the num_trajectories
keyword:
class = 0
impute_sites = collect(10:90)
instance_idx = 59
method = :ITS
imputed_ts, pred_err, target_ts, stats, plots = MPS_impute(
imp,
class,
instance_idx,
impute_sites,
method;
NN_baseline=false, # whether to also do a baseline imputation using 1-NN
plot_fits=true, # whether to plot the fits
num_trajectories=10, # number of trajectories to plot
rejection_threshold=2.5 # limits how unlikely we allow the random trajectories to be.
# there are more options! see [`MPS_impute`](@ref)
);
stats
10-element Vector{Any}:
Dict(:MAE => 0.3379187129486158, :MAPE => 0.5417349387056579)
Dict(:MAE => 0.28142164786903906, :MAPE => 0.5284433919968549)
Dict(:MAE => 0.32508625997795637, :MAPE => 0.6677156834479838)
Dict(:MAE => 0.2590505415150819, :MAPE => 0.5175741141323031)
Dict(:MAE => 0.38308884647433666, :MAPE => 0.3649478432250414)
Dict(:MAE => 0.2520764281706289, :MAPE => 0.5122443865478925)
Dict(:MAE => 0.29016616725543537, :MAPE => 0.4919483590734182)
Dict(:MAE => 0.26135393274535457, :MAPE => 0.48943630062743604)
Dict(:MAE => 0.29409516181993045, :MAPE => 0.5403912798238135)
Dict(:MAE => 0.2789132922226536, :MAPE => 0.4387091550037073)
plots[1]
Plotting cumulative distribution functions
It can be interesting to inspect the probability distribution being sampled from at each missing time point. To enable this, we provide the get_cdfs
function, which works very similarly to MPS_impute
, only it returns the CDF at each missing time point in the encoding domain. Currently only supports method=:median
.
using Plots
cdfs, ts, pred_err, target = get_cdfs(
imp,
class,
instance_idx,
impute_sites,
# method; # Only supports method=:median
)
xvals = imp.x_guess_range.xvals[1:10:end]
plot(xvals, cdfs[1][1:10:end]; legend=:none)
p = last([plot!(xvals, cdfs[i][1:10:end]) for i in eachindex(cdfs)])
ylabel!("cdf(x)")
xlabel!("x_t")
title!("CDF at each time point.")
Docstrings
MPSTime.init_imputation_problem
— Methodinit_imputation_problem(W::TrainedMPS, X_test::AbstractMatrix, y_test::AbstractArray=zeros(Int, size(X_test,1)), [custom_encoding::MPSTime.Encoding]; <keyword arguments>) -> imp::ImputationProblem
init_imputation_problem(W::TrainedMPS, X_test::AbstractMatrix, [custom_encoding::MPSTime.Encoding]); <keyword arguments>) -> imp::ImputationProblem
Initialise an imputation problem using a trained MPS and relevent test data.
This involves a lot of pre-computation, which can be quite time intensive for data-driven bases. For unclassed/unsupervised data y_test
may be omitted. If the MPS was trained with a custom encoding, then this encoding must be passed to init_imputation_problem
.
Keyword Arguments
guess_range::Union{Nothing, Tuple{<:Real,<:Real}}=nothing
: The range of values that guesses are allowed to take. This range is applied to normalised, encoding-adjusted time-series data. To allow any guess, leave as nothing, or set to encoding.range (e.g. [(-1., 1.) for the legendre encoding]).dx::Float64 = 1e-4
: The spacing between possible guesses in normalised, encoding-adjusted units. When imputing missing data with an MPS method, the imputed values will be selected from range(guess_range...; step=dx)verbosity::Integer=1
: The verbosity of the initialisation process. Useful for debugging, set to -1 to completely suppress output.test_encoding::Bool=true
: Whether to double check the encoding and scaling options are correct. This is strongly recommended but has a slight performance cost, so may be disabled.static_xvecs::Bool=true
: Whether to store encoded xvalues as StaticVectors. Usually improved performance
MPSTime.MPS_impute
— FunctionMPS_impute(
imp::ImputationProblem,
[class::Any,]
instance::Integer,
missing_sites::AbstractVector{<:Integer},
[method::Symbol=:median];
<keyword arguments>
) -> (imputed_instances::Vector, errors::Vector, targets::Vector, stats::Dict, plots::Vector{Plots.Plot})
Impute the missing_sites
using an MPS-based approach, selecting the trajectory from the conditioned distribution with method
.
The imputed time series, the imputation errors, the original time series, statistics about goodness of fit, and any plots generated by MPS_impute
are returned by imputed_instances
, errors
, targets
, stats
, and plots
respectively. These will always be length one vectors, unless the ITS
or kNearestNeighbour
method are used, which may cause MPS_impute
to return multiple potential imputations with the associated errors, stats, and plots. Goodness of fit is always measured with respect to unscaled
See init_imputation_problem
for constructing an ImputationProblem
instance out of a trained MPS. The instance
number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.
Imputation Methods
:median
: For each missing value, compute the probability density function of the possible outcomes from the MPS, and choose the median. This method is the most robust to outliers. Keywords:get_wmad::Bool=true
: Whether to return an 'error' vector that computes the Weighted Median Absolute Deviation (WMAD) of each imputed value.
:mean
: For each missing value, compute the probability density function of the possible outcomes from the MPS, and choose the expected value. Keywords:get_std::Bool=true
: Whether to return an 'error' vector that computes standard deviation of each imputed value.
:mode
: For each missing value, choose the most likely outcome predicted by the MPS. Keywords:max_jump::Union{Number,Nothing}=nothing
: The largest jump allowed between two adjacent imputations. Leave asnothing
to allow any jump. Helpful to suppress 'spikes' caused by poor support near the encoding domain edges.
:ITS
: For each missing value, choose a value at random with probability weighted by the probability density function of the possible outcomes. Keywords:rseed::Integer=1
: Random seed for producing the trajectories.num_trajectories::Integer=1
: Number of trajectories to compute.rejection_threshold::Union{Float64, Symbol}=:none
: Number of WMADs allowed between adjacent points. Setting this low helps suppress rapidly varying trajectories that occur by bad luck.max_trials::Integer=10
: Number of attempts allowed to make guesses conform to rejection_threshold before giving up.
:kNearestNeighbour
: Select thek
nearest neighbours in the training set using Euclidean distance to the known data. Keyword:k
: Number of nearest neighbours to return. SeekNN_impute
flatBaseline:
Predict the missing values are just the mean of the training set.
Keyword Arguments
impute_order::Symbol=:forwards
: Whether to impute the missing values:forwards
(left to right) or:backwards
(right to left)NN_baseline::Bool=true
: Whether to also impute the missing data with a k-Nearest Neighbour baseline.n_baselines::Integer=1
: How many nearest neighbour baselines to compute.plot_fits::Bool=true
: Whether to make a plot showing the target timeseries, the missing values, and the imputed region. If false, then p will be an empty vector. The plot will show the NN_baseline (if it was computed), as well as every trajectory if using the :ITS method.get_metrics::Bool=true
: Whether to compute imputation metrics, if false, thenstats
, will be empty.full_metrics::Bool=false
: Whether to compute every metric (MAPE, SMAPE, MAE, MSE, RMSE) or just MAE and MAPE.print_metric_table::Bool=false
: Whether to print thestats
as a table.invert_transform::Bool=true
:, # Whether to undo the sigmoid transform/minmax normalisation before returning the imputed points. If this is false, imputed_instance, errors, target timeseries, stats, and plot y-axis will all be scaled by the data preprocessing / normalisation and fit to the encoding domain.kwargs...
: Extra keywords passed toMPS_impute
are forwarded to the imputation method. See the Imputation Methods section.
MPSTime.get_cdfs
— Functionget_cdfs(imp::ImputationProblem,
class::Any,
instance::Integer,
missing_sites::AbstractVector{<:Integer},
[method::Symbol=:median];
<keyword arguments>) -> (cdfs::Vector{Vector}, ts::Vector, pred_err::Vector, target_timeseries_full::Vector)
Impute the missing_sites
using an MPS-based approach, selecting the trajectory from the conditioned distribution with method
, and returns the cumulative distribution function used to infer each missing value.
See MPS_impute
for a list of imputation methods and keyword arguments (does not support plotting, stats, or kNN baselines). See init_imputation_problem
for constructing an ImputationProblem instance. The instance
number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.
MPSTime.state_space
— Functionstate_space(T::Int, n::Int, s::Int=2; sigma::Float64=0.3, rng::AbstractRNG}=Random.GLOBAL_RNG) -> Matrix{Float64}
Generate n
time series of length T
each from a state space model with residual terms drawn from a normal distribution N(0, sigma
) and lag order s
. Time series are generated from the following model:
\[x_t = \mu_t + \theta_t + \eta_t \mu_t = \mu_{t-1} + \lambda_{t-1} + \xi_t \lambda_t = \lambda_{t-1} + \zeta_{t} \theta_t = \sum_{j=1}^{s-1} - \theta_{t-j} + \omega_t\]
where $x_t$ is the $t$-th value in the time series, and the residual terms $\eta_t$, $\xi_t$, $\zeta_t$ and $\omega_t$ are randomly drawn from a normal distribution $\mathcal{N}(0, \sigma)$.
Arguments
T
– Time series length.n
– Number of time-series instances.
Keyword Arguments
s
– Lag order (optional, default:2
).sigma
– Noise standard deviation (optional, default:0.3
).rng
– Random number generator of typeAbstractRNG
(optional, default:Random.GLOBAL_RNG
).
Returns
A Matrix{Float64}
of shape (n, T) containing the simulated time-series instances.
Internal imputation methods:
Internal imputation methods
MPSTime.impute_median
— Functionimpute missing data points using the median of the conditional distribution (single site rdm ρ).
Arguments
class_mps::MPS
:opts::Options
: MPS parameters.enc_args::AbstractVector
x_guess_range::EncodedDataRange
timeseries::AbstractVector{<:Number}
: The input time series data that will be imputed.timeseries_enc::MPS
: The encoded version of the time series represented as a product state.imputation_sites::Vector{Int}
: Indices in the time series where imputation is to be performed.get_wmad::Bool
: Whether to compute the weighted median absolute deviation (WMAD) during imputation (default isfalse
).
Returns
A tuple containing:
median_values::Vector{Float64}
: The imputed median values at the specified imputation sites.wmad_value::Union{Nothing, Float64}
: The weighted median absolute deviation ifget_wmad
is true; otherwise,nothing
.
MPSTime.impute_ITS
— FunctionImpute a SINGLE trajectory using inverse transform sampling (ITS).
MPSTime.kNN_impute
— FunctionkNN_impute(
imp::ImputationProblem,
[class::Any,]
instance::Integer,
missing_sites::AbstractVector{<:Integer};
k::Integer=1
) -> [neighbour1::Vector, neighbour2::Vector, ...]
Impute missing_sites
using the k
nearest neighbours in the test set, based on Euclidean distance.
See init_imputation_problem
for constructing an ImputationProblem instance. The instance
number is relative to the class, so class 1, instance 2 would be distinct from class 2, instance 2.