HOW BIOMASS IS MODELLED

Pasture Monitor follows a similar procedure to Otgonbayar (2019) to model biomass. In particular, we also use random forest regression to model pasture biomass. Random forest is a machine learning algorithm that uses an ensemble technique to perform either a classification or regression. The algorithm creates multiple decision trees using a different sample of the data for each tree (also referred to as a node). Using different decision trees allows multiple processes to be run in parallel with no interaction between them. The average of all predictions is taken at the end to produce the final result. This technique is known as bootstrap and aggregation (bagging).

The use of multiple decision trees increases stability and reduces variance. The stability of the random forest algorithm makes it common among projects with large datasets. However, as with any model, there will always be a degree of uncertainty. Some metrics that can help to explain this uncertainty are the R² score and RMSE (see below for more detail).

RANDOM FOREST REGRESSION FOR BIOMASS MODELLING

Pasture Monitor models pasture biomass using a six-step procedure.

Step 1: Collect training data

Pasture Monitor’s biomass model is trained using a reference data set. Each sample represents the averaged measured biomass per paddock. Sample collection started in 2021 and is continuing on a weekly basis. In contrast to Otgonbayar (2019) who used 553 biomass samples, we use a much larger reference dataset. At the time of writing, our reference dataset contained more than 270 thousand biomass samples; new samples are added on a weekly basis.

Step 2: Collect satellite data

Another difference between our methodology and that of Otgonbayar (2019) is that we make use of Sentinel-2 and PlanetScope imagery instead of Landsat-8 imagery. The advantage of using Sentinel-2 and PlanetScope imagery is that it has a much higher (3-10 m) spatial resolution than Landsat-8 imagery (30 m) and a shorter revisit time (1-5 days instead of Landsat's 16 days). A full set of Sentinel-2 (13 bands) and PlanetScope (8 bands) data are acquired and processed for each paddock in which the reference samples are collected. Samples are discarded if no images are available on the day of their measurement in the field. Images contaminated by clouds are excluded from consideration. Pasture Monitor has developed a technique for transforming the Sentinel-2 and PlanetScope spectral bands into biomass predictor variables per paddock that are most effective for biomass estimation.

Step 3: Build the model

The biomass measurements are randomly split into two separate sets of data, one for training the model (80% of the measurements) and one for testing the model (20% of the measurements). The 80% training subset is used to train the random forest regression model.

Step 4: Test the model

Once trained, the model's accuracy is quantified by comparing the predicted biomass values to the
(20%) independent set of testing measurements. Two measures are used, namely the coefficient of determination (or R²) and the root means square error (RMSE). R² is a statistical measure of how well the model approximates the actual data. It is calculated as the percentage of the variance in the dependent variable (measured biomass) explained by the independent variable (satellite imagery). R² can range from 0 to 1, where a value of 0 means that the model does not explain any variation in the dependent variable. A value of 1 means that the model perfectly fits the data. In general, a higher R² value is better, but can be interpreted as follows:

< 0.20. Very low: This indicates that the model does not explain much of the variation in the dependent variable. The model is not a good fit for the data.
0.20 - 0.40. Low: This indicates that the model explains some of the variation in the dependent variable, but not a lot. The model is a fair fit for the data.
0.40 to 0.60. Moderate: 0.40 to 0.60. This indicates that the model explains a moderate amount of the variation in the dependent variable. The model is a good fit for the data.
0.60 to 0.80. High: This indicates that the model explains a large amount of the variation in the dependent variable. The model is a very good fit for the data.
> 0.90. Very high: This indicates that the model perfectly fits the data. This is rare and may indicate that the model is overfitting the data.

RMSE a popular measure of model performance because it is easy to understand and interpret. RMSE is defined as:

Root-Mean-Square-Error-RMSE-in-Machine-Learning-scaled.jpg

Where N is the number of samples and i the ith sample used in the assessment. A low RMSE value indicates that the model is performing well. A RMSE value of 0 indicates that the model is perfect and there is no error between the predicted and actual values. The advantage of RMSE over R² is that RMSE reports accuracy in the units of the variable that is being predicted. For instance, for pasture biomass modelling, an RMSE of 200 indicates that the model is accurate to within 200 kg/ha. Or interpreted differently, the model's uncertainty is 200 kg/ha.

Step 5: Apply the model

The final step of Pasture Monitor's biomass modelling involves collecting satellite data for all paddocks in Pasture Monitor's database and applying the biomass model to those paddocks. The assumption is that the accuracy of the predicted biomass is the same as the accuracy that was quantified in the previous step. This is not always the case and care should be taken to take the uncertainty of the model (represented by the RMSE) into consideration when the predicted biomass data is used. This uncertainty is reported on the Biomass Per Paddock Graph and updated daily. The next section explains the model uncertainty in more detail.

MODEL PERFORMANCE

The biomass modelling steps (see previous section) are applied every day, which means that the model performance varies from day to day. For example, the scatter plot below shows the relationship between a model that was trained using 80% of the available samples, and applied to 20% of the samples not used in the training process. I.e. the actual dry biomass (kg/ha) values on the x-axis of the graph are independent of the data used to build the model. The y-axis of the graph shows the biomass values that were predicted using the model.

The R² value of this model is 0.87 and the RMSE is 188.3 kg/ha.

The graph illustrates that, on occasion, the model overestimated and underestimated biomass. In particular, overestimation tended to occur in the 1300 to 2000 kg/ha range, whereas the range of underestimation is from 2000 to 4000 kg/ha. However, these over- and underestimations are scarce. Based on a random sample of 30000 cases, underestimations (by more than 200 kg/ha) occurred 1361 (4.5%) times, while overestimations (by more than 200 kg/ha) occurred on 1402 (4.7%) occasions.

Phrased differently, in 91.8% of the cases, the model predicted biomass to within 200 kg/ha of the actual biomass.

The accuracy of the biomass modelling is continuously changing as new biomass measurements are added to the training dataset. Generally, the addition of new data should reduce error (because the machine learning model has more examples from which to learn), but that is not always the case because more data can also introduce more variation. However, as more variation is introduced, so does the model's ability to handle new (unseen) data. Data scientists refer to this as the model's robustness or transferability. CM's model is continuously getting more robust and accurate as new pasture measurement data is added to the training dataset.

REFERENCES

Munkhdulam Otgonbayar, Clement Atzberger, Jonathan Chambers & Amarsaikhan Damdinsuren
(2019) Mapping pasture biomass in Mongolia using Partial Least Squares, Random Forest regression and Landsat 8 imagery, International Journal of Remote Sensing, 40:8, 3204-3226, DOI:
10.1080/01431161.2018.1541110