首页 > 编程知识 正文

LGBM算法,lgb参数优化

时间:2023-05-05 18:02:25 阅读:190794 作者:3867

一、了解LGBM参数:

LGBM是微软发布的轻量梯度提升机,最主要的特点是快,回归和分类树模型。使用LGBM首先需要查看其参数含义:
微软官方github上的说明:
https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst#early_stopping_round
LGBM中文手册:
http://lightgbm.apachecn.org/#/docs/2

所以大致了解参数设置:
1、核心参数:
“objective”: “regression”,
‘boosting’:‘gbdt’,
“learning_rate”: 0.05,
‘num_iterations’:100,
“num_leaves”: 31,
‘num_threads’:-1,
2、学习控制参数
“max_depth”: 10,
‘feature_fraction’:1,
‘bagging_fraction’:0.8,
‘bagging_freq’:8,
‘lambda_l1’:0,
‘lambda_l2’:0,
“min_data_in_leaf”: 20
‘min_sum_hessian_in_leaf’:1e-3
‘early_stopping_round’
3、指标函数:
“metric”: “rmse”,可选的有很多,针对不同的问题

根据问题的性质,可以确定"metric":、“objective”: “regression”,其实主要需要调节的参数就有: “learning_rate”:、‘num_iterations’:、max_depth 和 num_leaves、min_data_in_leaf 和 min_sum_hessian_in_leaf、feature_fraction 和 bagging_fraction、正则化参数:lambda_l1(reg_alpha), lambda_l2(reg_lambda)

二、调参过程

网上的的代码,由于采用不同的工具,显得有点乱:在此总结一下:
参考:
网友的简书:(顺便讲了一下XGB)
https://www.jianshu.com/p/1100e333fcab
一篇csdn:
https://blog.csdn.net/ssswill/article/details/85235074
一篇慕课手记:
https://www.imooc.com/article/43784?block_id=tuijian_wz

1、 使用sklearn的GridSearchCV

先上例子:

import pandas as pdimport lightgbm as lgbfrom sklearn.grid_search import GridSearchCV # Perforing grid searchfrom sklearn.model_selection import train_test_splittrain_data = pd.read_csv('train.csv') # 读取数据y = train_data.pop('30').values # 用pop方式将训练数据中的标签值y取出来,作为训练目标,这里的‘30’是标签的列名col = train_data.columns x = train_data[col].values # 剩下的列作为训练数据train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.333, random_state=0) # 分训练集和验证集train = lgb.Dataset(train_x, train_y)valid = lgb.Dataset(valid_x, valid_y, reference=train)parameters = { 'max_depth': [15, 20, 25, 30, 35], 'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15], 'feature_fraction': [0.6, 0.7, 0.8, 0.9, 0.95], 'bagging_fraction': [0.6, 0.7, 0.8, 0.9, 0.95], 'bagging_freq': [2, 4, 5, 6, 8], 'lambda_l1': [0, 0.1, 0.4, 0.5, 0.6], 'lambda_l2': [0, 10, 15, 35, 40], 'cat_smooth': [1, 10, 15, 20, 35]}gbm = lgb.LGBMClassifier(boosting_type='gbdt', objective = 'binary', metric = 'auc', verbose = 0, learning_rate = 0.01, num_leaves = 35, feature_fraction=0.8, bagging_fraction= 0.9, bagging_freq= 8, lambda_l1= 0.6, lambda_l2= 0)#有了gridsearch我们便不需要fit函数gsearch = GridSearchCV(gbm, param_grid=parameters, scoring='accuracy', cv=3)gsearch.fit(train_x, train_y)print("Best score: %0.3f" % gsearch.best_score_)print("Best parameters set:")best_parameters = gsearch.best_estimator_.get_params()for param_name in sorted(parameters.keys()): print("t%s: %r" % (param_name, best_parameters[param_name]))

可以从参数看出,该代码的面对的问题是二分类,objective = ‘binary’,以AUC为指标, verbose 指多少轮迭代打印一次日志。
接下来重点讲:GridSearchCV()函数:
官方说明文档:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
其实主要就几个参数:
GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=’warn’)
estimator:估计器对象
param_grid : 参数字典
scoring : 评分指标,支持多个,逗号分隔
n_jobs : 线程数,-1全部
cv : 交叉验证折数,把gsearch.fit(train_x, train_y)中(train_x, train_y)cv等分,一份做验证

对于scoring 需要查看GridSearchCV支持的指标:
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
如要使用其他指标,首先导入该函数,然后在scoring 一项中写入对应地字符串,这里必须说一下,sklearn模型评估里的scoring参数都是采用的higher return values are better than lower return values(较高的返回值优于较低的返回值)。

缺点:无法使用估计器自带的earlystoping功能,很有可能调出来的参数过拟合,虽然可以设置一些参数和交叉验证来防止过拟合,

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。