Tidymodels is a package designed to make different types of models in a tidyverse-esque way. This package is particularly useful for implementing machine learning (ML) algorithms, as well as to divide your data into, training and test sets, etc. My particular personal interest was using this package to train models and then use those models to make predictions using raster data.
For this procedure you should use additional packages, beside tidymodels. sf contains tools for working with spatial information saved in vectors. yardstick is a package that contains several functions to get evaluation metrics for the models. raster is a package that is going to be superseded by terra to work with rasters. Additionally, fasterize is a package useful to make transformations from vector to raster and finally, doParallel will help to make parallel processing in R.
Load data and do some preprocessing
The first thing is to load the labeled data from which we will train and test our model. In this example, we are going to work with spatial information, rasters and vectors. So, we generated a dataset of disturbance and non-disturbance areas using visual interpretation. For the sake of this model, we are going to take the data as spatially independent, however, keep in mind that there are other types of models that can take into account spatial dependency. Here we will use BFAST components and type of forest (tropical dry forest, TDF or temperate forest, TF) as independent variables, while the only dependent variable will be if an area correspnods to disturbance or not. The main idea of the script is to compare a baseline model, based on a magnitude threshold of 0.2 vs a more complex model that might include several other indepdendent variables.
Divide dataset into training and test sets
After doing that short procedure you need to divide the complete dataset into training and test sets. This can easily be done using tidymodels. Additionally, I decided to use a separate vfold-cv-set (cross-validation set) to test the performance of different model architectures and obtain a standard error of the mean accuracy to decide which architecture could be considered as the best. After deciding the best model architecture, the the selected model will be trained and tested on the training and test datasets, respectively.
Once you got the training and test sets, as well as the vfold-cv-set the next step will be to specify the models you are going to train. This can be done by specifying different recipes.
Once all the recipes and workflow have been defined the next step is to fit or train the models. Remember we are going to first evaluate the models on the vfold-cv-set, then train it using the training dataset and evaluate it on the test set.
After fitting all the models, you should choose the best architecture. In this case, I chose the one with non-significant difference in accuracy with the highest achieved one and tht included less independent variables. You can make a plot to see the results in a graphic way.
After evaluating the results obtained in the vfold-cv-dataset, I chose the best model: “pred3_rf”. So we are using that name as the model name to do the rest of the process. This names are automatically created when training the model.
Now we have the trained models (*_notlast) and the trained and evaluated models (*_last_fit) on the test set. So the next step is to collect the predictions from the trained and evaluated models to calculate different evaluation metrics.
Additionally, you might be interested in obtaining the ROC curves of each model, so you can do it using the following script. In this step, we will create a data frame containing the ROC curves, which can afterwards used to construct a plot showing this results.
The last step of the workflow is to spatialize the model, i.e., apply the model using the rasters to obtain a final disturbance / non-disturbance map.