cnn卷积神经网络图,随机森林算法应用实例

随机森林算法概述：

随机森林(Random Forest )是一种灵活的机器学习算法。

其底部有使用多棵树训练和预测样本的分类器。在机器学习的许多领域得到了广泛的应用。

例如，建立医学疾病监测和患者敏感性预测模型。笔者自己曾将RF算法应用于癌变细胞的检测与分析。

随机森林算法原理：

随机森林的本质是通过综合学习(Ensemble Learning )将多个决策树综合的算法。

那么，首先让我们简单了解一下综合学习吧。

继承学习的意义在于构建和结合多个学习器，完成学习任务，也称为多分类器系统。

集成学习分为序列集成方法和并行集成方法。随机森林算法是后者的典型代表。

使用该方法的综合学习具有充分利用基础学习器之间的独立性，通过对结果进行平均化来降低错误概率的优点。

让我简单说明决策树的原理

决策树是一种划分数据集的方法

数据被分割为具有相似值的子集，从而构建完整的树。决策树的每个非叶节点是与其对应的特征属性的测试集合。

经过对每个特征属性的测试，生成了多个分支。每个分支都是测试特征属性时某个值域的输出的子集。

决策树的各叶的节点是表示输出结果的数据。

综合学习和决策树的介绍，可以得出结论：随机森林是由很多的相互不关联的决策树组成的，利用集成学习思想搭建的一种机器学习算法其准确率远远高于单一决策树。随机森林算法构建步骤：1.用n表示训练实例(样本)的个数，用m表示特征数。

2 .输入用于确定决策树上一个节点的确定结果的特征数m；其中m要远远小于m。

3 .从n个训练用例(样本)中采用回样方法采样n次，形成一个训练集(即bootstrap采样)，用未抽取用例(样本)进行预测并评估其误差。

4 .针对每个节点，随机选择m个特征，并基于这些特征确定决策树中每个节点的确定。根据这m个特征，计算其最佳分裂方式。所有的树都不会剪掉树枝而是完全长大。这可能在建造普通树的分类器后被采用)。

随机森林的优点：不同特征之间的相互影响是并行算法，因此可以判断训练速度快。很难贴合。随机森林的缺点：随机森林在噪声大的分类和回归问题上被证明过于拟合。 Python代码的实现

“”算法原理：随机林是一个包含多个决策树的分类器，其输出类别取决于每个树输出类别的频率。“” “' fromrandomimportseedfromrandomimportrandintimportnumpyasnp # cart树defdata _ split (索引，value， forrowindataset : if row [ index ] value : left.append (row ) else : right #.append (row )创建dataset(:leffrit ) right class _ values (3360 Gini=0.0 total _ size=0forgroupingroups 3360 total _ size=len (group ) forgroupingroups 3360 sid s _ values 3360 proportion=[ row [-1 ] forrowingroup ].count (class _ (class ) )。 gini=(size/float(total_size ) ) (proportion ) (1.0-proportion ) ) return gini#最佳分支点defget_split ) dataset， n _ features (: class _ values=list (set (row (-1 ) forrowindataset ) ) ) b_index，b_value，b_score b_ ) ) ) ) 65 none features=list (向while len (features ) n_features:#features添加n_features个特征(n_features )

features.append(index) for index in features: for row in dataset: groups = data_split(index, row[index], dataset) gini = calc_gini(groups, class_values) if gini < b_score: b_index, b_value, b_score, b_groups = index, row[index], gini, groups # 每个节点由字典组成 return {'index': b_index, 'value': b_value, 'groups': b_groups}# 多数表决def to_terminal(group): outcomes = [row[-1] for row in group] return max(set(outcomes), key=outcomes.count)# 分枝def split(node, max_depth, min_size, n_features, depth): left, right = node['groups'] del (node['groups']) if not left or not right: node['left'] = node['right'] = to_terminal(left + right) return if depth >= max_depth: node['left'], node['right'] = to_terminal(left), to_terminal(right) return if len(left) <= min_size: node['left'] = to_terminal(left) else: node['left'] = get_split(left, n_features) split(node['left'], max_depth, min_size, n_features, depth + 1) if len(right) <= min_size: node['right'] = to_terminal(right) else: node['right'] = get_split(right, n_features) split(node['right'], max_depth, min_size, n_features, depth + 1)# 建立一棵树def build_one_tree(train, max_depth, min_size, n_features): root = get_split(train, n_features) split(root, max_depth, min_size, n_features, 1) return root# 用一棵树来预测def predict(node, row): if row[node['index']] < node['value']: if isinstance(node['left'], dict): return predict(node['left'], row) else: return node['left'] else: if isinstance(node['right'], dict): return predict(node['right'], row) else: return node['right']# 随机森林类class randomForest: def __init__(self,trees_num, max_depth, leaf_min_size, sample_ratio, feature_ratio): self.trees_num = trees_num # 森林的树的数目 self.max_depth = max_depth # 树深 self.leaf_min_size = leaf_min_size # 建立树时，停止的分枝样本最小数目 self.samples_split_ratio = sample_ratio # 采样，创建子集的比例（行采样） self.feature_ratio = feature_ratio # 特征比例（列采样） self.trees = list() # 森林 # 有放回的采样，创建数据子集 def sample_split(self, dataset): sample = list() n_sample = round(len(dataset) * self.samples_split_ratio) while len(sample) < n_sample: index = randint(0, len(dataset) - 2) sample.append(dataset[index]) return sample # 建立随机森林 def build_randomforest(self, train): max_depth = self.max_depth min_size = self.leaf_min_size n_trees = self.trees_num # 列采样，从M个feature中，选择m个(m远小于M) n_features = int(self.feature_ratio * (len(train[0])-1)) for i in range(n_trees): sample = self.sample_split(train) tree = build_one_tree(sample, max_depth, min_size, n_features) self.trees.append(tree) return self.trees # 随机森林预测的多数表决 def bagging_predict(self, onetestdata): predictions = [predict(tree, onetestdata) for tree in self.trees] return max(set(predictions), key=predictions.count) # 计算建立的森林的精确度 def accuracy_metric(self, testdata): correct = 0 for i in range(len(testdata)): predicted = self.bagging_predict(testdata[i]) if testdata[i][-1] == predicted: correct += 1 return correct / float(len(testdata)) * 100.0# 数据处理def load_csv(filename): dataset = list() with open(filename, 'r') as file: csv_reader = reader(file) for row in csv_reader: if not row: continue dataset.append(row) return dataset# 划分训练数据与测试数据，默认取20%的数据当做测试数据def split_train_test(dataset, ratio=0.2): num = len(dataset) train_num = int((1-ratio) * num) dataset_copy = list(dataset) traindata = list() while len(traindata) < train_num: index = randint(0,len(dataset_copy)-1) traindata.append(dataset_copy.pop(index)) testdata = dataset_copy return traindata, testdata# 测试if __name__ == '__main__': train_feat = np.load("train_feat.npy") train_label = np.load("train_label.npy") test_feat = np.load("test_feat.npy") test_label = np.load("test_label.npy") train_label = np.expand_dims(train_label, 1) test_label = np.expand_dims(test_label, 1) traindata = np.concatenate((train_feat, train_label), axis=1) testdata = np.concatenate((train_feat, train_label), axis=1) # 决策树深度不能太深，不然容易导致过拟合 max_depth = 20 min_size = 1 sample_ratio = 1 trees_num = 20 feature_ratio = 0.3 RF = randomForest(trees_num, max_depth, min_size, sample_ratio, feature_ratio) RF.build_randomforest(traindata) acc = RF.accuracy_metric(testdata[:-1]) print('模型准确率：', acc, '%')