catboost参数效果不如默认参数,CatBoost

CatBoost是一种基于对称决策树（oblivious trees）为基学习器实现的参数较少、支持类别型变量和高准确性的GBDT框架，主要解决的痛点是高效合理地处理类别型特征，这一点从它的名字中可以看出来，CatBoost是由Categorical和Boosting组成。此外，CatBoost还解决了梯度偏差（Gradient Bias）以及预测偏移（Prediction shift）的问题，从而减少过拟合的发生，进而提高算法的准确性和泛化能力。
与XGBoost、LightGBM相比，CatBoost的创新点有：

嵌入了自动将类别型特征处理为数值型特征的创新算法。首先对categorical features做一些统计，计算某个类别特征（category）出现的频率，之后加上超参数，生成新的数值型特征（numerical features）。Catboost还使用了组合类别特征，可以利用到特征之间的联系，这极大的丰富了特征维度。采用排序提升的方法对抗训练集中的噪声点，从而避免梯度估计的偏差，进而解决预测偏移的问题。采用了完全对称树作为基模型。 CatBoost处理Categorical features总结首先会计算一些数据的statistics。计算某个category出现的频率，加上超参数，生成新的numerical features。这一策略要求同一标签数据不能排列在一起（即先全是之后全是这种方式），训练之前需要打乱数据集。第二，使用数据的不同排列（实际上是个）。在每一轮建立树之前，先扔一轮骰子，决定使用哪个排列来生成树。第三，考虑使用categorical features的不同组合。例如颜色和种类组合起来，可以构成类似于blue dog这样的特征。当需要组合的categorical features变多时，CatBoost只考虑一部分combinations。在选择第一个节点时，只考虑选择一个特征，例如A。在生成第二个节点时，考虑A和任意一个categorical feature的组合，选择其中最好的。就这样使用贪心算法生成combinations。第四，除非向gender这种维数很小的情况，不建议自己生成One-hot编码向量，最好交给算法来处理。参数：

loss_function 损失函数，支持的有RMSE, Logloss, MAE, CrossEntropy, Quantile, LogLinQuantile, Multiclass, MultiClassOneVsAll, MAPE,Poisson。默认RMSE。
custom_metric 训练过程中输出的度量值。这些功能未经优化，仅出于信息目的显示。默认None。
eval_metric 用于过拟合检验（设置True）和最佳模型选择（设置True）的loss function，用于优化。 RMSE、Logloss、MAE、CrossEntropy、Recall、Precision、F1、Accuracy、AUC、R2
iterations 最大树数。默认1000。
learning_rate 学习率。默认0.03。
random_seed 训练时候的随机种子
l2_leaf_reg L2正则参数。默认3
bootstrap_type 定义权重计算逻辑，可选参数：Poisson (supported for GPU only)/Bayesian/Bernoulli/No，默认为Bayesian
bagging_temperature 贝叶斯套袋控制强度，区间[0, 1]。默认1。
subsample 设置样本率，当bootstrap_type为Poisson或Bernoulli时使用，默认66
sampling_frequency设置创建树时的采样频率，可选值PerTree/PerTreeLevel，默认为PerTreeLevel
random_strength 分数标准差乘数。默认1。
use_best_model 设置此参数时，需要提供测试数据，树的个数通过训练参数和优化loss function获得。默认False。
best_model_min_trees 最佳模型应该具有的树的最小数目。
depth 树深，最大16，建议在1到10之间。默认6。
ignored_features 忽略数据集中的某些特征。默认None。
one_hot_max_size 如果feature包含的不同值的数目超过了指定值，将feature转化为float。默认False
has_time 在将categorical features转化为numerical features和选择树结构时，顺序选择输入数据。默认False（随机）
rsm 随机子空间（Random subspace method）。默认1。
nan_mode处理输入数据中缺失值的方法，包括Forbidden(禁止存在缺失)，Min(用最小值补)，Max(用最大值补)。默认Min。
fold_permutation_block_size数据集中的对象在随机排列之前按块分组。此参数定义块的大小。值越小，训练越慢。较大的值可能导致质量下降。
leaf_estimation_method 计算叶子值的方法，Newton/ Gradient。默认Gradient。
leaf_estimation_iterations 计算叶子值时梯度步数。
leaf_estimation_backtracking 在梯度下降期间要使用的回溯类型。
fold_len_multiplier folds长度系数。设置大于1的参数，在参数较小时获得最佳结果。默认2。
approx_on_full_history 计算近似值，False：使用1／fold_len_multiplier计算；True：使用fold中前面所有行计算。默认False。
class_weights 类别的权重。默认None。
scale_pos_weight 二进制分类中class 1的权重。该值用作class 1中对象权重的乘数。
boosting_type 增压方案
allow_const_label 使用它为所有对象训练具有相同标签值的数据集的模型。默认为False
od_type 要使用的过拟合检测器的类型。可能的值：‘IncToDec’、‘Iter’
od_wait 在迭代之后以最佳度量值继续训练的迭代次数。此参数的用途因所选的过拟合检测器类型而异：1.IncToDec —达到阈值时忽略过拟合检测器，并在迭代后使用最佳度量值继续学习指定的迭代次数。2.Iter —考虑模型过度拟合，并且自从具有最佳度量值的迭代以来，在指定的迭代次数后停止训练。
grow_policy 树生长策略。定义如何执行贪婪树的构建。可能的值：

SymmetricTree — 逐级构建树，直到达到指定的深度。在每次迭代中，以相同条件分割最后一棵树级别的所有叶子。生成的树结构始终是对称的。Depthwise - 逐级构建一棵树，直到达到指定的深度。在每次迭代中，将拆分来自最后一棵树的所有非终端叶子。每片叶子均按条件分割，损失改善最佳。Lossguide- 逐叶构建一棵树，直到达到指定的最大叶数。在每次迭代中，将损失损失最佳的非终端叶子进行拆分。
注意：不能使用PredictionDiff特征重要性来分析使用Depthwise和Lossguide增长策略的生成模型，只能将其导出到json和cbm。
task_type=CPU：训练的器件
devices=None：训练的GPU设备ID class Pool(data, label=None, cat_features=None, column_description=None, pairs=None, delimiter='t', has_header=False, weight=None, group_id=None, group_weight=None, subgroup_id=None, pairs_weight=None, baseline=None, feature_names=None, thread_count=-1) 简单分类： from catboost import CatBoostClassifier, Pooltrain_data = Pool(data=[[1, 4, 5, 6], [4, 5, 6, 7], [30, 40, 50, 60]], label=[1, 1, -1], weight=[0.1, 0.2, 0.3])train_data # <catboost.core.Pool at 0x1a22af06d0>model = CatBoostClassifier(iterations=10)model.fit(train_data)preds_class = model.predict(train_data) 几种预测： # Get predicted classespreds_class = model.predict(test_data)# Get predicted probabilities for each classpreds_proba = model.predict_proba(test_data)# Get predicted RawFormulaValpreds_raw = model.predict(test_data, prediction_type='RawFormulaVal') use_best_model 选择最好的模型输出 # 数据准备的部分见库和数据集准备部分params = { 'iterations': 500, 'learning_rate': 0.1, 'eval_metric': 'Accuracy', 'random_seed': 666, 'logging_level': 'Silent', 'use_best_model': False}# traintrain_pool = Pool(X_train, y_train, cat_features=categorical_features_indices)# validationvalidate_pool = Pool(X_validation, y_validation, cat_features=categorical_features_indices)# train with 'use_best_model': Falsemodel = CatBoostClassifier(**params)model.fit(train_pool, eval_set=validate_pool)# train with 'use_best_model': Truebest_model_params = params.copy()best_model_params.update({'use_best_model': True})best_model = CatBoostClassifier(**best_model_params)best_model.fit(train_pool, eval_set=validate_pool);# show resultprint('Simple model validation accuracy: {:.4}, and the number of trees: {}'.format( accuracy_score(y_validation, model.predict(X_validation)), model.tree_count_))print('')print('Best model validation accuracy: {:.4}, and the number of trees: {}'.format( accuracy_score(y_validation, best_model.predict(X_validation)),best_model.tree_count_)) 早停 earlystop_model_1.fit(train_pool, eval_set=validate_pool, early_stopping_rounds=200, verbose=20) 显示特征重要性： feature_importances = model.get_feature_importance(train_pool)feature_names = X_train.columnsfor score, name in sorted(zip(feature_importances, feature_names), reverse=True): print('{}: {}'.format(name, score)) 在之前预训练的基础上继续训练 model = CatBoostClassifier(**current_params).fit(X_train, y_train, categorical_features_indices)# Get baseline (only with prediction_type='RawFormulaVal')baseline = model.predict(X_train, prediction_type='RawFormulaVal')# Fit new modelmodel.fit(X_train, y_train, categorical_features_indices, baseline=baseline); 使用模型处理分类特征 categorical_features_indices = np.where(X.dtypes != np.float)[0]# categorical_features_indices 可以加载fit里，也可以放在Pool里# 如果fit 输入数据是Pool格式的，categorical_features_indices必须为None