rainforest.ml package

This submodule deals with the training and evaluation of machine learning QPE methods. It also allows to read them from pickle files stored in the rf_models subfolder.

rainforest.ml.rf : main module used to train RF regressors

rainforest.ml.rfdefinitions : reference module that contains definitions of the RF regressors and allows to load them from files

rainforest.ml.rf_train : command-line utility to train RF models and prepare input features

rainforest.ml.utils : small utilities used in this module only (for example for vertical aggregation)

rainforest.ml.rf module

rainforest.ml.rf_train module

Command line script to prepare input features and train RF models

see rf_train

rainforest.ml.rf_train.main()

rainforest.ml.rfdefinitions module

Class declarations and reading functions required to unpickle trained RandomForest models

Daniel Wolfensberger MeteoSwiss/EPFL daniel.wolfensberger@epfl.ch December 2019

class rainforest.ml.rfdefinitions.MyCustomUnpickler(file, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=())

Bases: Unpickler

This is an extension of the pickle Unpickler that handles the bookeeeping references to the RandomForestRegressorBC class

find_class(module, name)

Return an object from a specified module.

If necessary, the module will be imported. Subclasses may override this method (e.g. to restrict unpickling of arbitrary classes and functions).

This method is called whenever a class or a function object is needed. Both arguments passed are str objects.

class rainforest.ml.rfdefinitions.RandomForestRegressorBC(variables, beta, visib_weighting, degree=1, bctype='cdf', metadata={}, n_estimators=100, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)

Bases: RandomForestRegressor

Extended RandomForestRegressor with optional bias correction, on-the-fly rounding, metadata, and optional cross-validation.

fit(X, y, sample_weight=None, logmlflow='none', cv=0)

Fit both estimator and a-posteriori bias correction with optional cross-validation. :param X: The input samples. :type X: array-like or sparse matrix, shape=(n_samples, n_features) :param y: The target values. :type y: array-like, shape=(n_samples,) :param sample_weight: Sample weights. :type sample_weight: array-like of shape (n_samples,), default=None :param logmlflow: Whether to log training metrics to MLFlow. Can be ‘none’ to not log anything, ‘metrics’ to

only log metrics, or ‘all’ to log metrics and the trained model.

Parameters:

cv (int, default=0) – Number of folds for cross-validation. If set to 0, will not perform cross-validation (i.e. no test error)

Returns:

self

Return type:

object

fit_bias_correction(y, y_pred)
predict(X, round_func=None, bc=True)

Predict regression target for X. The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest. :param X: The input samples. Internally, its dtype will be converted to

dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Parameters:
  • round_func (lambda function) – Optional function to apply to outputs (for example to discretize them using MCH lookup tables). If not provided f(x) = x will be applied (i.e. no function)

  • bc (bool) – if True the bias correction function will be applied

Returns:

y – The predicted values.

Return type:

array-like of shape (n_samples,) or (n_samples, n_outputs)

set_fit_request(*, cv: bool | None | str = '$UNCHANGED$', logmlflow: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:
  • cv (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cv parameter in fit.

  • logmlflow (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for logmlflow parameter in fit.

  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_predict_request(*, bc: bool | None | str = '$UNCHANGED$', round_func: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:
  • bc (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for bc parameter in predict.

  • round_func (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for round_func parameter in predict.

Returns:

self – The updated object.

Return type:

object

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

rainforest.ml.rfdefinitions.read_rf(rf_name='', filepath='', mlflow_runid=None)

Reads a randomForest model from the RF models folder using pickle. All custom classes and functions used in the construction of these pickled models must be defined in the script ml/rf_definitions.py

Parameters:
  • rf_name (str) – Name of the randomForest model, it must be stored in the folder $RAINFOREST_DATAPATH/rf_models and computed with the rf:RFTraining.fit_model function

  • filepath (str) – Path to the model files, if not in default folder

  • mlflow_runid (str) – If the model needs to be downloaded from mlflow, this variable indicates the run ID that contains the model to use. If this value is not None, rf_name and filepath are ignored. The env variable MLFLOW_TRACKING_URI needs to be set.

Returns:

  • A trained sklearn randomForest instance that has the predict() method,

  • that allows to predict precipitation intensities for new points

rainforest.ml.utils module

Utility functions for the ML submodule

rainforest.ml.utils.make_run_id()
rainforest.ml.utils.nesteddictvalues(d)
rainforest.ml.utils.split_event(timestamps, n=5, threshold_hr=12, random_state=None)

Splits the dataset into n subsets by separating the observations into separate precipitation events and attributing these events randomly to the subsets

Parameters:
  • timestamps (int array) – array containing the UNIX timestamps of the precipitation observations

  • n (int) – number of subsets to create

  • threshold_hr (int) – threshold in hours to distinguish precip events. Two timestamps are considered to belong to a different event if there is a least threshold_hr hours of no observations (no rain) between them.

  • random_state (None or int) – Reproducibility of event assignment

Returns:

split_idx – array containing the subset grouping, with values from 0 to n - 1

Return type:

int array

rainforest.ml.utils.split_years(timestamps, years=[2016, 2017, 2018, 2019, 2020, 2021])

Splits the dataset into n subsets by separating the observations into separate years

Parameters:
  • timestamps (int array) – array containing the UNIX timestamps of the precipitation observations

  • years (int list) – all years to split into

Returns:

split_idx – array containing the subset grouping, with values from 0 to years-1

Return type:

int array

rainforest.ml.utils.time_station_aggregation(df, aggregation_min, time_res_min=None)

Creates the aggregation array for every station for a period of time

Parameters:
  • df (dataframe) – dataframe that contains the STATION and TIMESTAMP columns

  • aggregation_min (int) – aggregation time to use in minutes time_res_min: int time in min between two consecutive timestamps, if not provided will be found from the data

Returns:

agg – the aggregation array which contains values such as ABO1619866800 (station followed by aggregated UNIX timestamp)

Return type:

numpy str array

rainforest.ml.utils.vert_aggregation(radar_data, vert_weights, grp_vertical, visib_weight=True, visib=None)

Performs vertical aggregation of radar observations aloft to the ground using a weighted average. Categorical variables such as ‘RADAR’, ‘HYDRO’, ‘TCOUNT’, will be assigned dummy variables and these dummy variables will be aggregated, resulting in columns such as RADAR_propA giving the weighted proportion of radar observation aloft that were obtained with the Albis radar

Parameters:
  • radar_data (Pandas DataFrame) – A Pandas DataFrame containing all required input features aloft as explained in the rf.py module

  • vert_weights (np.array of float) – vertical weights to use for every observation in radar, must have the same len as radar_data

  • grp_vertical (np.array of int) – grouping index for the vertical aggregation. It must have the same len as radar_data. All observations corresponding to the same timestep must have the same label

  • visib_weight (bool) – if True the input features will be weighted by the visibility when doing the vertical aggregation to the ground

  • visib (np array) – visibily of every observation, required only if visib_weight = True