rainforest.ml package
This submodule deals with the training and evaluation of machine learning QPE methods. It also allows to read them from pickle files stored in the rf_models subfolder.
rainforest.ml.rf
: main module used to train RF regressors
rainforest.ml.rfdefinitions
: reference module that contains definitions of the RF regressors and allows to load them from files
rainforest.ml.rf_train
: command-line utility to train RF models and prepare input features
rainforest.ml.utils
: small utilities used in this module only (for example for vertical aggregation)
rainforest.ml.rf module
Main module to
- class rainforest.ml.rf.RFTraining(db_location, input_location=None, force_regenerate_input=False, logmlflow='none', cv=0)
Bases:
object
This is the main class that allows to preparate data for random forest training, train random forests and perform cross-validation of trained models
Initializes the class and if needed prepare input data for the training
Note that when calling this constructor the input data is only generated for the central pixel (NX = NY = 0 = loc of gauge), if you want to regenerate the inputs for all neighbour pixels, please call the function self.prepare_input(only_center_pixel = False)
- Parameters:
db_location (str) – Location of the main directory of the database (with subfolders ‘reference’, ‘gauge’ and ‘radar’ on the filesystem)
input_location (str) – Location of the prepared input data, if this data cannot be found in this folder, it will be computed here, default is a subfolder called rf_input_data within db_location
force_regenerate_input (bool) – if True the input parquet files will always be regenerated from the database even if already present in the input_location folder
logmlflow (str, default='none') – Whether to log training metrics to MLFlow. Can be ‘none’ to not log anything, ‘metrics’ to only log metrics, or ‘all’ to log metrics and the trained model.
- feature_selection(features_dic, featuresel_configfile, output_folder, K=5, tstart=None, tend=None)
The relative importance of all available input vairables aggregated to to the ground and to choose the most important ones, an approach from Han et al. (2016) was adpated to for regression. See Wolfensberger et al. (2021) for further information.
- Parameters:
features (dic) – A dictionnary with all eligible features to test
feature_sel_config (str) – yaml file with setup
output_folder (str) – Path to where to store the scores
tstart (str (YYYYMMDDHHMM)) – A date to define a starting time for the input data
tend (str (YYYYMMDDHHMM)) – A date to define the end of the input data
K (int or None) – Number of splits in iterations do perform in the K fold cross-val
- fit_models(config_file, features_dic, tstart=None, tend=None, output_folder=None, cv=0)
Fits a new RF model that can be used to compute QPE realizations and saves them to disk in pickle format
- Parameters:
config_file (str) – Location of the RF training configuration file, if not provided the default one in the ml submodule will be used
features_dic (dict) – A dictionary whose keys are the names of the models you want to create (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will train a model with all these features that will then be stored under the name RF_dualpol_BC_<type of BC>.p in the ml/rf_models dir
tstart (datetime) – the starting time of the training time interval, default is to start at the beginning of the time interval covered by the database
tend (datetime) – the end time of the training time interval, default is to end at the end of the time interval covered by the database
output_folder (str) – Location where to store the trained models in pickle format, if not provided it will store them in the standard location <library_path>/ml/rf_models
cv (int, default=0) – Number of folds for cross-validation, when running fit function. If set to 0, will not perform cross-validation (i.e. no test error)
- model_intercomparison(features_dic, intercomparison_configfile, output_folder, reference_products=['CPCH', 'RZC'], bounds10=[0, 2, 10, 100], bounds60=[0, 2, 10, 100], cross_val_type='years', K=5, years=None, tstart=None, tend=None, station_scores=False, save_model=False)
Does an intercomparison (cross-validation) of different RF models and reference products (RZC, CPC, …) and plots the performance plots
- Parameters:
features_dic (dict) – A dictionary whose keys are the names of the models you want to compare (a string) and the values are lists of features you want to use. For example {‘RF_dualpol’:[‘RADAR’, ‘zh_VISIB_mean’, ‘zv_VISIB_mean’,’KDP_mean’,’RHOHV_mean’,’T’, ‘HEIGHT’,’VISIB_mean’], ‘RF_hpol’:[‘RADAR’, ‘zh_VISIB_mean’,’T’, ‘HEIGHT’,’VISIB_mean’]} will compare a model of RF with polarimetric info to a model with only horizontal polarization
output_folder (str) – Location where to store the output plots
intercomparison_config (str) – Location of the intercomparison configuration file, which is a yaml file that gives for every model key of features_dic which parameters of the training you want to use (see the file intercomparison_config_example.yml in this module for an example)
reference_products (list of str) – Name of the reference products to which the RF will be compared they need to be in the reference table of the database
bounds10 (list of float) – list of precipitation bounds for which to compute scores separately at 10 min time resolution [0,2,10,100] will give scores in range [0-2], [2-10] and [10-100]
bounds60 (list of float) – list of precipitation bounds for which to compute scores separately at hourly time resolution [0,1,10,100] will give scores in range [0-1], [1-10] and [10-100]
cross_val_type (str) – Define how the split of events is done. Options are “random events”, “years” and “seasons” (TODO)
K (int or None) – Number of splits in iterations do perform in the K fold cross-val
years (list or None) – List with the years that should be used in cross validation Default is [2016,2017,2018,2019,2020,2021]
tstart (str (YYYYMMDDHHMM)) – A date to define a starting time for the input data
tend (str (YYYYMMDDHHMM)) – A date to define the end of the input data
station_scores (True or False (Boolean)) – If True, performance scores for all stations will be calculated If False, only the scores across Switzerland are calculated
save_model (True or False (Boolean)) – If True, all models of the cross-validation are saved into a pickle file This is useful for reproducibility
- prepare_input(only_center=True, foldername_radar='radar')
Reads the data from the database in db_location and processes it to create easy to use parquet input files for the ML training and stores them in the input_location, the processing steps involve
For every neighbour of the station (i.e. from -1-1 to +1+1):
Replace missing flags by nans
Filter out timesteps which are not present in the three tables (gauge, reference and radar)
Filter out incomplete hours (i.e. where less than 6 10 min timesteps are available)
Add height above ground and height of iso0 to radar data
Save a separate parquet file for radar, gauge and reference data
Save a grouping_idx pickle file containing grp_vertical index (groups all radar rows with same timestep and station), grp_hourly (groups all timesteps with same hours) and tstamp_unique (list of all unique timestamps)
- Parameters:
only_center (bool) – If set to True only the input data for the central neighbour i.e. NX = NY = 0 (the location of the gauge) will be recomputed this takes much less time and is the default option since until now the neighbour values are not used in the training of the RF QPE
foldername_radar (str) – Name of the folder to use for the radar data. Default name is ‘radar’
rainforest.ml.rf_train module
Command line script to prepare input features and train RF models
see rf_train
- rainforest.ml.rf_train.main()
rainforest.ml.rfdefinitions module
Class declarations and reading functions required to unpickle trained RandomForest models
Daniel Wolfensberger MeteoSwiss/EPFL daniel.wolfensberger@epfl.ch December 2019
- class rainforest.ml.rfdefinitions.MyCustomUnpickler(file, *, fix_imports=True, encoding='ASCII', errors='strict', buffers=())
Bases:
Unpickler
This is an extension of the pickle Unpickler that handles the bookeeeping references to the RandomForestRegressorBC class
- find_class(module, name)
Return an object from a specified module.
If necessary, the module will be imported. Subclasses may override this method (e.g. to restrict unpickling of arbitrary classes and functions).
This method is called whenever a class or a function object is needed. Both arguments passed are str objects.
- class rainforest.ml.rfdefinitions.RandomForestRegressorBC(variables, beta, visib_weighting, degree=1, bctype='cdf', metadata={}, n_estimators=100, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False)
Bases:
RandomForestRegressor
Extended RandomForestRegressor with optional bias correction, on-the-fly rounding, metadata, and optional cross-validation.
- fit(X, y, sample_weight=None, logmlflow='none', cv=0)
Fit both estimator and a-posteriori bias correction with optional cross-validation. :param X: The input samples. :type X: array-like or sparse matrix, shape=(n_samples, n_features) :param y: The target values. :type y: array-like, shape=(n_samples,) :param sample_weight: Sample weights. :type sample_weight: array-like of shape (n_samples,), default=None :param logmlflow: Whether to log training metrics to MLFlow. Can be ‘none’ to not log anything, ‘metrics’ to
only log metrics, or ‘all’ to log metrics and the trained model.
- Parameters:
cv (int, default=0) – Number of folds for cross-validation. If set to 0, will not perform cross-validation (i.e. no test error)
- Returns:
self
- Return type:
object
- fit_bias_correction(y, y_pred)
- predict(X, round_func=None, bc=True)
Predict regression target for X. The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest. :param X: The input samples. Internally, its dtype will be converted to
dtype=np.float32
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- Parameters:
round_func (lambda function) – Optional function to apply to outputs (for example to discretize them using MCH lookup tables). If not provided f(x) = x will be applied (i.e. no function)
bc (bool) – if True the bias correction function will be applied
- Returns:
y – The predicted values.
- Return type:
array-like of shape (n_samples,) or (n_samples, n_outputs)
- set_fit_request(*, cv: bool | None | str = '$UNCHANGED$', logmlflow: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC
Configure whether metadata should be requested to be passed to the
fit
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- cvstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
cv
parameter infit
.- logmlflowstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
logmlflow
parameter infit
.- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
- selfobject
The updated object.
- set_predict_request(*, bc: bool | None | str = '$UNCHANGED$', round_func: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC
Configure whether metadata should be requested to be passed to the
predict
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- bcstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
bc
parameter inpredict
.- round_funcstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
round_func
parameter inpredict
.
- selfobject
The updated object.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') RandomForestRegressorBC
Configure whether metadata should be requested to be passed to the
score
method.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True
(seesklearn.set_config()
). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter inscore
.
- selfobject
The updated object.
- rainforest.ml.rfdefinitions.read_rf(rf_name='', filepath='', mlflow_runid=None)
Reads a randomForest model from the RF models folder using pickle. All custom classes and functions used in the construction of these pickled models must be defined in the script ml/rf_definitions.py
- Parameters:
rf_name (str) – Name of the randomForest model, it must be stored in the folder /ml/rf_models and computed with the rf:RFTraining.fit_model function
filepath (str) – Path to the model files, if not in default folder
mlflow_runid (str) – If the model needs to be downloaded from mlflow, this variable indicates the run ID that contains the model to use. If this value is not None, rf_name and filepath are ignored. The env variable MLFLOW_TRACKING_URI needs to be set.
- Returns:
A trained sklearn randomForest instance that has the predict() method,
that allows to predict precipitation intensities for new points
rainforest.ml.utils module
Utility functions for the ML submodule
- rainforest.ml.utils.nesteddictvalues(d)
- rainforest.ml.utils.split_event(timestamps, n=5, threshold_hr=12, random_state=None)
Splits the dataset into n subsets by separating the observations into separate precipitation events and attributing these events randomly to the subsets
- Parameters:
timestamps (int array) – array containing the UNIX timestamps of the precipitation observations
n (int) – number of subsets to create
threshold_hr (int) – threshold in hours to distinguish precip events. Two timestamps are considered to belong to a different event if there is a least threshold_hr hours of no observations (no rain) between them.
random_state (None or int) – Reproducibility of event assignment
- Returns:
split_idx – array containing the subset grouping, with values from 0 to n - 1
- Return type:
int array
- rainforest.ml.utils.split_years(timestamps, years=[2016, 2017, 2018, 2019, 2020, 2021])
Splits the dataset into n subsets by separating the observations into separate years
- Parameters:
timestamps (int array) – array containing the UNIX timestamps of the precipitation observations
years (int list) – all years to split into
- Returns:
split_idx – array containing the subset grouping, with values from 0 to years-1
- Return type:
int array
- rainforest.ml.utils.vert_aggregation(radar_data, vert_weights, grp_vertical, visib_weight=True, visib=None)
Performs vertical aggregation of radar observations aloft to the ground using a weighted average. Categorical variables such as ‘RADAR’, ‘HYDRO’, ‘TCOUNT’, will be assigned dummy variables and these dummy variables will be aggregated, resulting in columns such as RADAR_propA giving the weighted proportion of radar observation aloft that were obtained with the Albis radar
- Parameters:
radar_data (Pandas DataFrame) – A Pandas DataFrame containing all required input features aloft as explained in the rf.py module
vert_weights (np.array of float) – vertical weights to use for every observation in radar, must have the same len as radar_data
grp_vertical (np.array of int) – grouping index for the vertical aggregation. It must have the same len as radar_data. All observations corresponding to the same timestep must have the same label
visib_weight (bool) – if True the input features will be weighted by the visibility when doing the vertical aggregation to the ground
visib (np array) – visibily of every observation, required only if visib_weight = True