=================
wildboar tutorial
=================
Machine learning
================
In general, the machine learning problem setting consists of n samples of data an the goal is to predict properties of unknown data.
``wildboar``, in particular consider machine learning problems in which the samples are data series, e.g., time series or
otherwise temporally or logically ordered data.
.. note::
For solving general machine learning problems with Python, consider using `scikit-learn `_
Similar to general machine learning problems, temporal machine learning consider problems that fall into different categories
* Supervised learning, in which the data series are labeled with additional information. The additional information can
be either numerical or nominal
* In classification problems the time series belong to one of two or more labels and the goal is to learn a function that can label unlabeled time series.
* In regression problems the time series are labeled with a numerical attribute and the task is to assigned a new numerical value to an unlabeled time series.
Loading an example dataset
==========================
Wildboar bundles a few `standard datasets (no https) `_ from the time series community.
In the example, we load the dataset ``synthetic_control`` and the ``TwoLeadECG`` dataset.
.. code-block:: python
>>> from wildboar.datasets import load_synthetic_control, load_two_lead_ecg
>>> x, y = load_synthetic_control()
>>> x_train, x_test, y_train, y_test = load_two_lead_ecg(merge_train_test=False)
The datasets are Numpy ``ndarray`` with ``x.ndim==2`` and ``y.ndim==1``. We can get the number of samples and time points.
.. code-block:: python
>>> n_samples, n_timestep = x.shape
.. note::
By setting ``merge_train_test`` to `False`, the original training and testing splits from the UCR repository are preserved.
A more robust and reliable method for splitting the datasets into training and testing partitions is to use the model selection functions from scikit-learn.
.. code-block:: python
>> from sklearn.model_selection import train_test_split
>> x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
Learning and predicting
=======================
All estimators in wildboar implements the same interface as all estimators of scikit-learn. We can `fit` an estimator to an input dataset and `predict` the label of a new sample.
An example of a temporal estimator is the ``wildboar.ensemble.ShapeletForestClassifier`` which implements a random shapelet forest classifier.
.. code-block:: python
>>> from wildboar.ensemble import ShapeletForestClassifier
>>> clf = ShapeletForestClassifier()
>>> clf.fit(x_train, y_train)
The classifier (``clf``) is fitted using the training samples, i.e., a model is inferred.
.. code-block:: python
>>> clf.predict(x_test[-1:, :])
array([6.])
.. note::
The predict function expects an ``ndarray`` of shape ``(n_samples, n_timestep)``, where ``n_timestep`` is the size
of training timestep.
Model persistence
=================
All `wildboar` models can be persisted to disk using `pickle `_
.. code-block:: python
>>> import pickle
>>> repr = pickle.dumps(clf) # clf fitted earlier
>>> clf_ = pickle.loads(repr)
>>> clf_.predict(x_test[-1:, :])
array([6.])
.. note::
Models persisted using an older versions of wildboar is not guaranteed to work when using a newer version (or vice versa).
.. warning::
`The pickle module is not secure. Only unpickle data you trust. `_