Machine-learning command-line tool
The ml submodule has only a command-line tool to update the input data for the RF QPE algorithm and for training new RF models. For more sophisticated procedures please use ml_module.
rf_train
Updates any of the three tables of the database gauge, radar and reference with new data.
rf_train [options]
- Options:
- -h, --help
show this help message and exit
- -o OUTPUT, --outputfolder=OUTPUT
Path of the output folder, default is the ml/rf_models folder in the current library
- -d DBFOLDER, --dbfolder=DBFOLDER
Path of the database main folder, default is /store/msrad/radar/radar_database/
- -i DBFOLDER, --inputfolder=DBFOLDER
Path where the homogeneized input files for the RF algorithm are stored, default is the subfolder ‘rf_input_data’ within the database folder
- -s START, --start=START
Specify the start time in the format YYYYddmmHHMM, if not provided the first timestamp in the database will be used
- -e END, --end=END
Specify the end time in the format YYYYddmmHHMM, if not provided the last timestamp in the database will be used
- -c CONFIG, --config=CONFIG
Path of the config file, the default will be default_config.yml in the ml module
- -m MODELS, --models=MODELS
Specify which models you want to use in the form of a json line of a dict, the keys are names you give to the models, the values the input features they require, for example ‘{“RF_dualpol”: [“RADAR”, “zh_visib_mean”, “zv_visib_mean”,”KDP_mean”,”RHOHV_mean”,”T”, “HEIGHT”,”VISIB_mean”]}’, please note the double and single quotes, which are requiredIMPORTANT: if no model is provided only the ml input data will be recomputed from the database, but no model will be computedTo simplify three aliases are proposed: “dualpol_default” = ‘{“RF_dualpol”: [“RADAR”, “zh_visib_mean”, “zv_visib_mean”,”KDP_mean”,”RHOHV_mean”,”T”, “HEIGHT”,”VISIB_mean”]}‘“vpol_default” = ‘{“RF_vpol”: [“RADAR”, “zv_visib_mean”,”T”, “HEIGHT”,”VISIB_mean”]}‘“hpol_default” = ‘{“RF_hpol”: [“RADAR”, “zh_visib_mean”,”T”, “HEIGHT”,”VISIB_mean”]}’You can combine them for example “vpol_default, hpol_default, dualpol_default, will compute all three”
- -g MODELS, --generate_inputs=MODELS
If set to 1 (default), the input parquet files (homogeneized tables) for the ml routines will be recomputed from the current database rowsThis takes a bit of time but is needed if you updated the database and want to use the new data in the training
The configuration file must be written in YAML, the default file has the following structure:
FILTERING: # conditions to remove some observations
STA_TO_REMOVE : ['TIT','GSB','GRH','PIL','SAE','AUB']
CONSTRAINT_MIN_ZH : [0.5,20] # min 20 dBZ if R > 0.5 mm/h
CONSTRAINT_MAX_ZH : [0,20] # max 20 dBZ if R = 0 mm/h
RANDOMFORESTREGRESSOR_PARAMS: # parameters to sklearn's class
max_depth : 20
n_estimators : 10
VERTAGG_PARAMS:
BETA : -0.5 # weighting factor to use in the exponential weighting
VISIB_WEIGHTING : 1 # whether to weigh or not observations by their visib
BIAS_CORR : 'raw' # type of bias correction 'raw', 'cdf' or 'spline'
The parameters are the following
FILTERING : a set of parameters used to filter the input data on which the algorithm is trained
STA_TO_REMOVE : list of problematic stations to remove
CONSTRAINT_MIN_ZH : constraint on minimum reflectivity, the first value if the precip. intensity, the second the minimum value required value of ZH. For example for [0.5,20] all rows where ZH < 20 dBZ if R >= 0.5 mm/h will be removed. This is to reduce the effect of large spatial and temporal offset between radar and gauge.
CONSTRAINT_MAX_ZH : constraint on maximum reflectivity, the first value if the precip. intensity, the second the minimum value required value of ZH.
RANDOMFORESTREGRESSOR_PARAMS : set of parameters for the sklearn random forest regressor . You can add as many as you want, as long as they are valid parameters for this class
max_depth : max depth of the threes
n_estimators : number of trees
VERTAGG_PARAMS : set of parameters for the vertical aggregation of radar data to the ground
BETA : the parameter used in the exponential weighting \(\exp(-\beta \cdot h)\), where h is the height of every observation. BETA should be negative, since lower observation should have a larger weight.
VISIB_WEIGHTING : if set to 1, the observations will also be weighted proportionally to their visibility
BIAS_CORR : type of bias-correction to be applied a-posteriori. It can be either ‘raw’ in which case a simple linear regression of prediction vs observation is used, ‘cdf’ in which a simple linear regression on sorted prediction vs sorted observation is used and ‘spline’ which is the same as ‘cdf’ except that a 1D spline is used instead.