rainforest.ml package

This submodule deals with the training and evaluation of machine learning QPE methods. It also allows to read them from pickle files stored in the rf_models subfolder.

rainforest.ml.rf : main module used to train RF regressors

rainforest.ml.rfdefinitions : reference module that contains definitions of the RF regressors and allows to load them from files

rainforest.ml.rf_train : command-line utility to train RF models and prepare input features

rainforest.ml.utils : small utilities used in this module only (for example for vertical aggregation)

rainforest.ml.rf module

Main module to

class rainforest.ml.rf.RFTraining(db_location, input_location=None, force_regenerate_input=False, logmlflow='none', cv=0)

Bases: object

This is the main class that allows to preparate data for random forest training, train random forests and perform cross-validation of trained models

Initializes the class and if needed prepare input data for the training

Note that when calling this constructor the input data is only generated for the central pixel (NX = NY = 0 = loc of gauge), if you want to regenerate the inputs for all neighbour pixels, please call the function self.prepare_input(only_center_pixel = False)

Parameters:

db_location (str) – Location of the main directory of the database (with subfolders ‘reference’, ‘gauge’ and ‘radar’ on the filesystem)
input_location (str) – Location of the prepared input data, if this data cannot be found in this folder, it will be computed here, default is a subfolder called rf_input_data within db_location
force_regenerate_input (bool) – if True the input parquet files will always be regenerated from the database even if already present in the input_location folder
logmlflow (str, default='none') – Whether to log training metrics to MLFlow. Can be ‘none’ to not log anything, ‘metrics’ to only log metrics, or ‘all’ to log metrics and the trained model.

feature_selection(features_dic, featuresel_configfile, output_folder, K=5, tstart=None, tend=None)

The relative importance of all available input vairables aggregated to to the ground and to choose the most important ones, an approach from Han et al. (2016) was adpated to for regression. See Wolfensberger et al. (2021) for further information.

Parameters:

features (dic) – A dictionnary with all eligible features to test
feature_sel_config (str) – yaml file with setup
output_folder (str) – Path to where to store the scores
tstart (str (YYYYMMDDHHMM)) – A date to define a starting time for the input data
tend (str (YYYYMMDDHHMM)) – A date to define the end of the input data
K (int or None) – Number of splits in iterations do perform in the K fold cross-val

fit_models(config_file, features_dic, tstart=None, tend=None, output_folder=None, cv=0)

Fits a new RF model that can be used to compute QPE realizations and saves them to disk in pickle format

Parameters:

config_file (str) – Location of the RF training configuration file, if not provided the default one in the ml submodule will be used
features_dic (dict) – A dictionary whose keys are the names of the models you want to create (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will train a model with all these features that will then be stored under the name RF_dualpol_BC_<type of BC>.p in the ml/rf_models dir
tstart (datetime) – the starting time of the training time interval, default is to start at the beginning of the time interval covered by the database
tend (datetime) – the end time of the training time interval, default is to end at the end of the time interval covered by the database
output_folder (str) – Location where to store the trained models in pickle format, if not provided it will store them in the standard location <library_path>/ml/rf_models
cv (int, default=0) – Number of folds for cross-validation, when running fit function. If set to 0, will not perform cross-validation (i.e. no test error)

model_intercomparison(features_dic, intercomparison_configfile, output_folder, reference_products=['CPCH', 'RZC'], bounds10=[0, 2, 10, 100], bounds60=[0, 2, 10, 100], cross_val_type='years', K=5, years=None, tstart=None, tend=None, station_scores=False, save_model=False)

Does an intercomparison (cross-validation) of different RF models and reference products (RZC, CPC, …) and plots the performance plots

Parameters:

features_dic (dict) – A dictionary whose keys are the names of the models you want to compare (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’], ‘RF_hpol’:[‘RADAR’, ‘zh_VISIB_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will compare a model of RF with polarimetric info to a model with only horizontal polarization
output_folder (str) – Location where to store the output plots
intercomparison_config (str) – Location of the intercomparison configuration file, which is a yaml file that gives for every model key of features_dic which parameters of the training you want to use (see the file intercomparison_config_example.yml in this module for an example)
reference_products (list of str) – Name of the reference products to which the RF will be compared they need to be in the reference table of the database
bounds10 (list of float) – list of precipitation bounds for which to compute scores separately at 10 min time resolution [0,2,10,100] will give scores in range [0-2], [2-10] and [10-100]
bounds60 (list of float) – list of precipitation bounds for which to compute scores separately at hourly time resolution [0,1,10,100] will give scores in range [0-1], [1-10] and [10-100]
cross_val_type (str) – Define how the split of events is done. Options are “random events”, “years” and “seasons” (TODO)
K (int or None) – Number of splits in iterations do perform in the K fold cross-val
years (list or None) – List with the years that should be used in cross validation Default is [2016,2017,2018,2019,2020,2021]
tstart (str (YYYYMMDDHHMM)) – A date to define a starting time for the input data
tend (str (YYYYMMDDHHMM)) – A date to define the end of the input data
station_scores (True or False (Boolean)) – If True, performance scores for all stations will be calculated If False, only the scores across Switzerland are calculated
save_model (True or False (Boolean)) – If True, all models of the cross-validation are saved into a pickle file This is useful for reproducibility

prepare_input(only_center=True, foldername_radar='radar')

Reads the data from the database in db_location and processes it to create easy to use parquet input files for the ML training and stores them in the input_location, the processing steps involve

For every neighbour of the station (i.e. from -1-1 to +1+1):

Replace missing flags by nans
Filter out timesteps which are not present in the three tables (gauge, reference and radar)
Filter out incomplete hours (i.e. where less than 6 10 min timesteps are available)
Add height above ground and height of iso0 to radar data
Save a separate parquet file for radar, gauge and reference data
Save a grouping_idx pickle file containing grp_vertical index (groups all radar rows with same timestep and station), grp_hourly (groups all timesteps with same hours) and tstamp_unique (list of all unique timestamps)

Parameters:

only_center (bool) – If set to True only the input data for the central neighbour i.e. NX = NY = 0 (the location of the gauge) will be recomputed this takes much less time and is the default option since until now the neighbour values are not used in the training of the RF QPE
foldername_radar (str) – Name of the folder to use for the radar data. Default name is ‘radar’

rainforest.ml.rf_train module

Command line script to prepare input features and train RF models

see rf_train

rainforest.ml.rf_train.main()

rainforest.ml.rfdefinitions module

Class declarations and reading functions required to unpickle trained RandomForest models

Daniel Wolfensberger MeteoSwiss/EPFL daniel.wolfensberger@epfl.ch December 2019

class rainforest.ml.rfdefinitions.MyCustomUnpickler(file, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=())

Bases: Unpickler

This is an extension of the pickle Unpickler that handles the bookeeeping references to the RandomForestRegressorBC class

find_class(module, name)

Return an object from a specified module.

If necessary, the module will be imported. Subclasses may override this method (e.g. to restrict unpickling of arbitrary classes and functions).

This method is called whenever a class or a function object is needed. Both arguments passed are str objects.

class rainforest.ml.rfdefinitions.RandomForestRegressorBC(variables, beta, visib_weighting, degree=1, bctype='cdf', metadata={}, n_estimators=100, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)

Bases: RandomForestRegressor

Extended RandomForestRegressor with optional bias correction, on-the-fly rounding, metadata, and optional cross-validation.

fit(X, y, sample_weight=None, logmlflow='none', cv=0)

Fit both estimator and a-posteriori bias correction with optional cross-validation. :param X: The input samples. :type X: array-like or sparse matrix, shape=(n_samples, n_features) :param y: The target values. :type y: array-like, shape=(n_samples,) :param sample_weight: Sample weights. :type sample_weight: array-like of shape (n_samples,), default=None :param logmlflow: Whether to log training metrics to MLFlow. Can be ‘none’ to not log anything, ‘metrics’ to

only log metrics, or ‘all’ to log metrics and the trained model.

Parameters:: cv (int, default=0) – Number of folds for cross-validation. If set to 0, will not perform cross-validation (i.e. no test error)
Returns:: self
Return type:: object

fit_bias_correction(y, y_pred)

predict(X, round_func=None, bc=True)

Predict regression target for X. The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest. :param X: The input samples. Internally, its dtype will be converted to

dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Parameters:

round_func (lambda function) – Optional function to apply to outputs (for example to discretize them using MCH lookup tables). If not provided f(x) = x will be applied (i.e. no function)
bc (bool) – if True the bias correction function will be applied

Returns:

y – The predicted values.

Return type:

array-like of shape (n_samples,) or (n_samples, n_outputs)

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to fit.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

cvstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for cv parameter in fit.

logmlflowstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for logmlflow parameter in fit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for sample_weight parameter in fit.

selfobject
The updated object.

set_predict_request(*, bc: bool | None | str = '$UNCHANGED$', round_func: bool | None | str = '$UNCHANGED$') → RandomForestRegressorBC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to predict.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

bcstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for bc parameter in predict.

round_funcstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for round_func parameter in predict.

selfobject
The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → RandomForestRegressorBC

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to score.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for sample_weight parameter in score.

selfobject
The updated object.

rainforest.ml.rfdefinitions.read_rf(rf_name='', filepath='', mlflow_runid=None)

Reads a randomForest model from the RF models folder using pickle. All custom classes and functions used in the construction of these pickled models must be defined in the script ml/rf_definitions.py

Parameters:

rf_name (str) – Name of the randomForest model, it must be stored in the folder /ml/rf_models and computed with the rf:RFTraining.fit_model function
filepath (str) – Path to the model files, if not in default folder
mlflow_runid (str) – If the model needs to be downloaded from mlflow, this variable indicates the run ID that contains the model to use. If this value is not None, rf_name and filepath are ignored. The env variable MLFLOW_TRACKING_URI needs to be set.

Returns:

A trained sklearn randomForest instance that has the predict() method,
that allows to predict precipitation intensities for new points

rainforest.ml.utils module

Utility functions for the ML submodule

rainforest.ml.utils.nesteddictvalues(d)

rainforest.ml.utils.split_event(timestamps, n=5, threshold_hr=12, random_state=None)

Splits the dataset into n subsets by separating the observations into separate precipitation events and attributing these events randomly to the subsets

Parameters:

timestamps (int array) – array containing the UNIX timestamps of the precipitation observations
n (int) – number of subsets to create
threshold_hr (int) – threshold in hours to distinguish precip events. Two timestamps are considered to belong to a different event if there is a least threshold_hr hours of no observations (no rain) between them.
random_state (None or int) – Reproducibility of event assignment

Returns:

split_idx – array containing the subset grouping, with values from 0 to n - 1

Return type:

int array

rainforest.ml.utils.split_years(timestamps, years=[2016, 2017, 2018, 2019, 2020, 2021])

Splits the dataset into n subsets by separating the observations into separate years

Parameters:

timestamps (int array) – array containing the UNIX timestamps of the precipitation observations
years (int list) – all years to split into

Returns:

split_idx – array containing the subset grouping, with values from 0 to years-1

Return type:

int array

rainforest.ml.utils.vert_aggregation(radar_data, vert_weights, grp_vertical, visib_weight=True, visib=None)

Performs vertical aggregation of radar observations aloft to the ground using a weighted average. Categorical variables such as ‘RADAR’, ‘HYDRO’, ‘TCOUNT’, will be assigned dummy variables and these dummy variables will be aggregated, resulting in columns such as RADAR_propA giving the weighted proportion of radar observation aloft that were obtained with the Albis radar

Parameters:

radar_data (Pandas DataFrame) – A Pandas DataFrame containing all required input features aloft as explained in the rf.py module
vert_weights (np.array of float) – vertical weights to use for every observation in radar, must have the same len as radar_data
grp_vertical (np.array of int) – grouping index for the vertical aggregation. It must have the same len as radar_data. All observations corresponding to the same timestep must have the same label
visib_weight (bool) – if True the input features will be weighted by the visibility when doing the vertical aggregation to the ground
visib (np array) – visibily of every observation, required only if visib_weight = True