rainforest.ml package

This submodule deals with the training and evaluation of machine learning QPE methods. It also allows to read them from pickle files stored in the rf_models subfolder.

rainforest.ml.rf : main module used to train RF regressors

rainforest.ml.rfdefinitions : reference module that contains definitions of the RF regressors and allows to load them from files

rainforest.ml.rf_train : command-line utility to train RF models and prepare input features

rainforest.ml.utils : small utilities used in this module only (for example for vertical aggregation)

rainforest.ml.rf module

Main module to

class rainforest.ml.rf.RFTraining(db_location, input_location=None, force_regenerate_input=False, logmlflow='none', cv=0)

Bases: object

This is the main class that allows to preparate data for random forest training, train random forests and perform cross-validation of trained models

Initializes the class and if needed prepare input data for the training

Note that when calling this constructor the input data is only generated for the central pixel (NX = NY = 0 = loc of gauge), if you want to regenerate the inputs for all neighbour pixels, please call the function self.prepare_input(only_center_pixel = False)

Parameters:
  • db_location (str) – Location of the main directory of the database (with subfolders ‘reference’, ‘gauge’ and ‘radar’ on the filesystem)

  • input_location (str) – Location of the prepared input data, if this data cannot be found in this folder, it will be computed here, default is a subfolder called rf_input_data within db_location

  • force_regenerate_input (bool) – if True the input parquet files will always be regenerated from the database even if already present in the input_location folder

  • logmlflow (str, default='none') – Whether to log training metrics to MLFlow. Can be ‘none’ to not log anything, ‘metrics’ to only log metrics, or ‘all’ to log metrics and the trained model.

feature_selection(features_dic, featuresel_configfile, output_folder, K=5, tstart=None, tend=None)

The relative importance of all available input vairables aggregated to to the ground and to choose the most important ones, an approach from Han et al. (2016) was adpated to for regression. See Wolfensberger et al. (2021) for further information.

Parameters:
  • features (dic) – A dictionnary with all eligible features to test

  • feature_sel_config (str) – yaml file with setup

  • output_folder (str) – Path to where to store the scores

  • tstart (str (YYYYMMDDHHMM)) – A date to define a starting time for the input data

  • tend (str (YYYYMMDDHHMM)) – A date to define the end of the input data

  • K (int or None) – Number of splits in iterations do perform in the K fold cross-val

fit_models(config_file, features_dic, tstart=None, tend=None, output_folder=None, cv=0)

Fits a new RF model that can be used to compute QPE realizations and saves them to disk in pickle format

Parameters:
  • config_file (str) – Location of the RF training configuration file, if not provided the default one in the ml submodule will be used

  • features_dic (dict) – A dictionary whose keys are the names of the models you want to create (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will train a model with all these features that will then be stored under the name RF_dualpol_BC_<type of BC>.p in the ml/rf_models dir

  • tstart (datetime) – the starting time of the training time interval, default is to start at the beginning of the time interval covered by the database

  • tend (datetime) – the end time of the training time interval, default is to end at the end of the time interval covered by the database

  • output_folder (str) – Location where to store the trained models in pickle format, if not provided it will store them in the standard location <library_path>/ml/rf_models

  • cv (int, default=0) – Number of folds for cross-validation, when running fit function. If set to 0, will not perform cross-validation (i.e. no test error)

model_intercomparison(features_dic, intercomparison_configfile, output_folder, reference_products=['CPCH', 'RZC'], bounds10=[0, 2, 10, 100], bounds60=[0, 2, 10, 100], cross_val_type='years', K=5, years=None, tstart=None, tend=None, station_scores=False, save_model=False)

Does an intercomparison (cross-validation) of different RF models and reference products (RZC, CPC, …) and plots the performance plots

Parameters:
  • features_dic (dict) – A dictionary whose keys are the names of the models you want to compare (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’], ‘RF_hpol’:[‘RADAR’, ‘zh_VISIB_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will compare a model of RF with polarimetric info to a model with only horizontal polarization

  • output_folder (str) – Location where to store the output plots

  • intercomparison_config (str) – Location of the intercomparison configuration file, which is a yaml file that gives for every model key of features_dic which parameters of the training you want to use (see the file intercomparison_config_example.yml in this module for an example)

  • reference_products (list of str) – Name of the reference products to which the RF will be compared they need to be in the reference table of the database

  • bounds10 (list of float) – list of precipitation bounds for which to compute scores separately at 10 min time resolution [0,2,10,100] will give scores in range [0-2], [2-10] and [10-100]

  • bounds60 (list of float) – list of precipitation bounds for which to compute scores separately at hourly time resolution [0,1,10,100] will give scores in range [0-1], [1-10] and [10-100]

  • cross_val_type (str) – Define how the split of events is done. Options are “random events”, “years” and “seasons” (TODO)

  • K (int or None) – Number of splits in iterations do perform in the K fold cross-val

  • years (list or None) – List with the years that should be used in cross validation Default is [2016,2017,2018,2019,2020,2021]

  • tstart (str (YYYYMMDDHHMM)) – A date to define a starting time for the input data

  • tend (str (YYYYMMDDHHMM)) – A date to define the end of the input data

  • station_scores (True or False (Boolean)) – If True, performance scores for all stations will be calculated If False, only the scores across Switzerland are calculated

  • save_model (True or False (Boolean)) – If True, all models of the cross-validation are saved into a pickle file This is useful for reproducibility

prepare_input(only_center=True, foldername_radar='radar')

Reads the data from the database in db_location and processes it to create easy to use parquet input files for the ML training and stores them in the input_location, the processing steps involve

For every neighbour of the station (i.e. from -1-1 to +1+1):

  • Replace missing flags by nans

  • Filter out timesteps which are not present in the three tables (gauge, reference and radar)

  • Filter out incomplete hours (i.e. where less than 6 10 min timesteps are available)

  • Add height above ground and height of iso0 to radar data

  • Save a separate parquet file for radar, gauge and reference data

  • Save a grouping_idx pickle file containing grp_vertical index (groups all radar rows with same timestep and station), grp_hourly (groups all timesteps with same hours) and tstamp_unique (list of all unique timestamps)

Parameters:
  • only_center (bool) – If set to True only the input data for the central neighbour i.e. NX = NY = 0 (the location of the gauge) will be recomputed this takes much less time and is the default option since until now the neighbour values are not used in the training of the RF QPE

  • foldername_radar (str) – Name of the folder to use for the radar data. Default name is ‘radar’

rainforest.ml.rf_train module

Command line script to prepare input features and train RF models

see rf_train

rainforest.ml.rf_train.main()

rainforest.ml.rfdefinitions module

Class declarations and reading functions required to unpickle trained RandomForest models

Daniel Wolfensberger MeteoSwiss/EPFL daniel.wolfensberger@epfl.ch December 2019

class rainforest.ml.rfdefinitions.MyCustomUnpickler(file, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=())

Bases: Unpickler

This is an extension of the pickle Unpickler that handles the bookeeeping references to the RandomForestRegressorBC class

find_class(module, name)

Return an object from a specified module.

If necessary, the module will be imported. Subclasses may override this method (e.g. to restrict unpickling of arbitrary classes and functions).

This method is called whenever a class or a function object is needed. Both arguments passed are str objects.

class rainforest.ml.rfdefinitions.RandomForestRegressorBC(variables, beta, visib_weighting, degree=1, bctype='cdf', metadata={}, n_estimators=100, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)

Bases: RandomForestRegressor

Extended RandomForestRegressor with optional bias correction, on-the-fly rounding, metadata, and optional cross-validation.

fit(X, y, sample_weight=None, logmlflow='none', cv=0)

Fit both estimator and a-posteriori bias correction with optional cross-validation. :param X: The input samples. :type X: array-like or sparse matrix, shape=(n_samples, n_features) :param y: The target values. :type y: array-like, shape=(n_samples,) :param sample_weight: Sample weights. :type sample_weight: array-like of shape (n_samples,), default=None :param logmlflow: Whether to log training metrics to MLFlow. Can be ‘none’ to not log anything, ‘metrics’ to

only log metrics, or ‘all’ to log metrics and the trained model.

Parameters:

cv (int, default=0) – Number of folds for cross-validation. If set to 0, will not perform cross-validation (i.e. no test error)

Returns:

self

Return type:

object

fit_bias_correction(y, y_pred)
predict(X, round_func=None, bc=True)

Predict regression target for X. The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest. :param X: The input samples. Internally, its dtype will be converted to

dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Parameters:
  • round_func (lambda function) – Optional function to apply to outputs (for example to discretize them using MCH lookup tables). If not provided f(x) = x will be applied (i.e. no function)

  • bc (bool) – if True the bias correction function will be applied

Returns:

y – The predicted values.

Return type:

array-like of shape (n_samples,) or (n_samples, n_outputs)

set_fit_request(*, cv: bool | None | str = '$UNCHANGED$', logmlflow: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

cvstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for cv parameter in fit.

logmlflowstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for logmlflow parameter in fit.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in fit.

selfobject

The updated object.

set_predict_request(*, bc: bool | None | str = '$UNCHANGED$', round_func: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

bcstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for bc parameter in predict.

round_funcstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for round_func parameter in predict.

selfobject

The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for sample_weight parameter in score.

selfobject

The updated object.

rainforest.ml.rfdefinitions.read_rf(rf_name='', filepath='', mlflow_runid=None)

Reads a randomForest model from the RF models folder using pickle. All custom classes and functions used in the construction of these pickled models must be defined in the script ml/rf_definitions.py

Parameters:
  • rf_name (str) – Name of the randomForest model, it must be stored in the folder /ml/rf_models and computed with the rf:RFTraining.fit_model function

  • filepath (str) – Path to the model files, if not in default folder

  • mlflow_runid (str) – If the model needs to be downloaded from mlflow, this variable indicates the run ID that contains the model to use. If this value is not None, rf_name and filepath are ignored. The env variable MLFLOW_TRACKING_URI needs to be set.

Returns:

  • A trained sklearn randomForest instance that has the predict() method,

  • that allows to predict precipitation intensities for new points

rainforest.ml.utils module

Utility functions for the ML submodule

rainforest.ml.utils.nesteddictvalues(d)
rainforest.ml.utils.split_event(timestamps, n=5, threshold_hr=12, random_state=None)

Splits the dataset into n subsets by separating the observations into separate precipitation events and attributing these events randomly to the subsets

Parameters:
  • timestamps (int array) – array containing the UNIX timestamps of the precipitation observations

  • n (int) – number of subsets to create

  • threshold_hr (int) – threshold in hours to distinguish precip events. Two timestamps are considered to belong to a different event if there is a least threshold_hr hours of no observations (no rain) between them.

  • random_state (None or int) – Reproducibility of event assignment

Returns:

split_idx – array containing the subset grouping, with values from 0 to n - 1

Return type:

int array

rainforest.ml.utils.split_years(timestamps, years=[2016, 2017, 2018, 2019, 2020, 2021])

Splits the dataset into n subsets by separating the observations into separate years

Parameters:
  • timestamps (int array) – array containing the UNIX timestamps of the precipitation observations

  • years (int list) – all years to split into

Returns:

split_idx – array containing the subset grouping, with values from 0 to years-1

Return type:

int array

rainforest.ml.utils.vert_aggregation(radar_data, vert_weights, grp_vertical, visib_weight=True, visib=None)

Performs vertical aggregation of radar observations aloft to the ground using a weighted average. Categorical variables such as ‘RADAR’, ‘HYDRO’, ‘TCOUNT’, will be assigned dummy variables and these dummy variables will be aggregated, resulting in columns such as RADAR_propA giving the weighted proportion of radar observation aloft that were obtained with the Albis radar

Parameters:
  • radar_data (Pandas DataFrame) – A Pandas DataFrame containing all required input features aloft as explained in the rf.py module

  • vert_weights (np.array of float) – vertical weights to use for every observation in radar, must have the same len as radar_data

  • grp_vertical (np.array of int) – grouping index for the vertical aggregation. It must have the same len as radar_data. All observations corresponding to the same timestep must have the same label

  • visib_weight (bool) – if True the input features will be weighted by the visibility when doing the vertical aggregation to the ground

  • visib (np array) – visibily of every observation, required only if visib_weight = True