rainforest.ml package
This submodule deals with the training and evaluation of machine learning QPE methods. It also allows to read them from pickle files stored in the rf_models subfolder.
rainforest.ml.rf : main module used to train RF regressors
rainforest.ml.rfdefinitions : reference module that contains definitions of the RF regressors and allows to load them from files
rainforest.ml.rf_train : command-line utility to train RF models and prepare input features
rainforest.ml.utils : small utilities used in this module only (for example for vertical aggregation)
rainforest.ml.rf module
rainforest.ml.rf_train module
Command line script to prepare input features and train RF models
see rf_train
- rainforest.ml.rf_train.main()
rainforest.ml.rfdefinitions module
Class declarations and reading functions required to unpickle trained RandomForest models
Daniel Wolfensberger MeteoSwiss/EPFL daniel.wolfensberger@epfl.ch December 2019
- class rainforest.ml.rfdefinitions.MyCustomUnpickler(file, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=())
Bases:
UnpicklerThis is an extension of the pickle Unpickler that handles the bookeeeping references to the RandomForestRegressorBC class
- find_class(module, name)
Return an object from a specified module.
If necessary, the module will be imported. Subclasses may override this method (e.g. to restrict unpickling of arbitrary classes and functions).
This method is called whenever a class or a function object is needed. Both arguments passed are str objects.
- class rainforest.ml.rfdefinitions.RandomForestRegressorBC(variables, beta, visib_weighting, degree=1, bctype='cdf', metadata={}, n_estimators=100, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)
Bases:
RandomForestRegressorExtended RandomForestRegressor with optional bias correction, on-the-fly rounding, metadata, and optional cross-validation.
- fit(X, y, sample_weight=None, logmlflow='none', cv=0)
Fit both estimator and a-posteriori bias correction with optional cross-validation. :param X: The input samples. :type X: array-like or sparse matrix, shape=(n_samples, n_features) :param y: The target values. :type y: array-like, shape=(n_samples,) :param sample_weight: Sample weights. :type sample_weight: array-like of shape (n_samples,), default=None :param logmlflow: Whether to log training metrics to MLFlow. Can be ‘none’ to not log anything, ‘metrics’ to
only log metrics, or ‘all’ to log metrics and the trained model.
- Parameters:
cv (int, default=0) – Number of folds for cross-validation. If set to 0, will not perform cross-validation (i.e. no test error)
- Returns:
self
- Return type:
object
- fit_bias_correction(y, y_pred)
- predict(X, round_func=None, bc=True)
Predict regression target for X. The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest. :param X: The input samples. Internally, its dtype will be converted to
dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix.- Parameters:
round_func (lambda function) – Optional function to apply to outputs (for example to discretize them using MCH lookup tables). If not provided f(x) = x will be applied (i.e. no function)
bc (bool) – if True the bias correction function will be applied
- Returns:
y – The predicted values.
- Return type:
array-like of shape (n_samples,) or (n_samples, n_outputs)
- set_fit_request(*, cv: bool | None | str = '$UNCHANGED$', logmlflow: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
cv (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cvparameter infit.logmlflow (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
logmlflowparameter infit.sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter infit.
- Returns:
self – The updated object.
- Return type:
object
- set_predict_request(*, bc: bool | None | str = '$UNCHANGED$', round_func: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC
Configure whether metadata should be requested to be passed to the
predictmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed topredictif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it topredict.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
bc (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
bcparameter inpredict.round_func (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
round_funcparameter inpredict.
- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.- Returns:
self – The updated object.
- Return type:
object
- rainforest.ml.rfdefinitions.read_rf(rf_name='', filepath='', mlflow_runid=None)
Reads a randomForest model from the RF models folder using pickle. All custom classes and functions used in the construction of these pickled models must be defined in the script ml/rf_definitions.py
- Parameters:
rf_name (str) – Name of the randomForest model, it must be stored in the folder $RAINFOREST_DATAPATH/rf_models and computed with the rf:RFTraining.fit_model function
filepath (str) – Path to the model files, if not in default folder
mlflow_runid (str) – If the model needs to be downloaded from mlflow, this variable indicates the run ID that contains the model to use. If this value is not None, rf_name and filepath are ignored. The env variable MLFLOW_TRACKING_URI needs to be set.
- Returns:
A trained sklearn randomForest instance that has the predict() method,
that allows to predict precipitation intensities for new points
rainforest.ml.utils module
Utility functions for the ML submodule
- rainforest.ml.utils.make_run_id()
- rainforest.ml.utils.nesteddictvalues(d)
- rainforest.ml.utils.split_event(timestamps, n=5, threshold_hr=12, random_state=None)
Splits the dataset into n subsets by separating the observations into separate precipitation events and attributing these events randomly to the subsets
- Parameters:
timestamps (int array) – array containing the UNIX timestamps of the precipitation observations
n (int) – number of subsets to create
threshold_hr (int) – threshold in hours to distinguish precip events. Two timestamps are considered to belong to a different event if there is a least threshold_hr hours of no observations (no rain) between them.
random_state (None or int) – Reproducibility of event assignment
- Returns:
split_idx – array containing the subset grouping, with values from 0 to n - 1
- Return type:
int array
- rainforest.ml.utils.split_years(timestamps, years=[2016, 2017, 2018, 2019, 2020, 2021])
Splits the dataset into n subsets by separating the observations into separate years
- Parameters:
timestamps (int array) – array containing the UNIX timestamps of the precipitation observations
years (int list) – all years to split into
- Returns:
split_idx – array containing the subset grouping, with values from 0 to years-1
- Return type:
int array
- rainforest.ml.utils.time_station_aggregation(df, aggregation_min, time_res_min=None)
Creates the aggregation array for every station for a period of time
- Parameters:
df (dataframe) – dataframe that contains the STATION and TIMESTAMP columns
aggregation_min (int) – aggregation time to use in minutes time_res_min: int time in min between two consecutive timestamps, if not provided will be found from the data
- Returns:
agg – the aggregation array which contains values such as ABO1619866800 (station followed by aggregated UNIX timestamp)
- Return type:
numpy str array
- rainforest.ml.utils.vert_aggregation(radar_data, vert_weights, grp_vertical, visib_weight=True, visib=None)
Performs vertical aggregation of radar observations aloft to the ground using a weighted average. Categorical variables such as ‘RADAR’, ‘HYDRO’, ‘TCOUNT’, will be assigned dummy variables and these dummy variables will be aggregated, resulting in columns such as RADAR_propA giving the weighted proportion of radar observation aloft that were obtained with the Albis radar
- Parameters:
radar_data (Pandas DataFrame) – A Pandas DataFrame containing all required input features aloft as explained in the rf.py module
vert_weights (np.array of float) – vertical weights to use for every observation in radar, must have the same len as radar_data
grp_vertical (np.array of int) – grouping index for the vertical aggregation. It must have the same len as radar_data. All observations corresponding to the same timestep must have the same label
visib_weight (bool) – if True the input features will be weighted by the visibility when doing the vertical aggregation to the ground
visib (np array) – visibily of every observation, required only if visib_weight = True