C4.5算法Python实现

本文将详细介绍C4.5算法在Python中的实现方法。

一、C4.5算法简介

C4.5算法是一种决策树学习算法，采用信息增益比来选择最优的划分属性。它通过对训练数据集进行递归划分，生成一棵决策树模型。C4.5算法的主要思想是以信息熵的减少作为选择最优划分属性的标准，同时考虑到属性的取值数目不同对信息增益的影响，引入了信息增益比来解决这个问题。

以下是C4.5算法的Python实现代码：

import math

def entropy(data):
    n = len(data)
    class_counts = {}
    for row in data:
        label = row[-1]
        if label not in class_counts:
            class_counts[label] = 0
        class_counts[label] += 1
    entropy = 0
    for count in class_counts.values():
        p = count / n
        entropy -= p * math.log2(p)
    return entropy

def information_gain(data, attribute_index):
    original_entropy = entropy(data)
    attribute_values = set([row[attribute_index] for row in data])
    gain = original_entropy
    for value in attribute_values:
        subset = [row for row in data if row[attribute_index] == value]
        p = len(subset) / len(data)
        gain -= p * entropy(subset)
    return gain / original_entropy

def choose_best_attribute(data, attributes):
    best_gain = 0
    best_attribute = None
    for i, attribute in enumerate(attributes):
        gain = information_gain(data, i)
        if gain > best_gain:
            best_gain = gain
            best_attribute = attribute
    return best_attribute

def create_decision_tree(data, attributes):
    class_labels = set([row[-1] for row in data])
    if len(class_labels) == 1:
        return class_labels.pop()
    if len(attributes) == 0:
        class_counts = {}
        for row in data:
            label = row[-1]
            if label not in class_counts:
                class_counts[label] = 0
            class_counts[label] += 1
        return max(class_counts, key=class_counts.get)
    best_attribute = choose_best_attribute(data, attributes)
    decision_tree = {best_attribute: {}}
    attribute_values = set([row[attributes.index(best_attribute)] for row in data])
    for value in attribute_values:
        subset = [row for row in data if row[attributes.index(best_attribute)] == value]
        new_attributes = [attr for attr in attributes if attr != best_attribute]
        decision_tree[best_attribute][value] = create_decision_tree(subset, new_attributes)
    return decision_tree

二、C4.5算法步骤

1、计算数据集的熵值。

2、对于每个属性，计算其信息增益。

3、选择信息增益最大的属性作为划分属性。

4、根据划分属性的取值对数据集进行划分。

5、递归地对每个子数据集进行划分，直到满足终止条件。

三、C4.5算法实例

以下是一个使用C4.5算法进行鸢尾花分类的示例：

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

attributes = ['sepal length', 'sepal width', 'petal length', 'petal width']
data = [list(row) + [target] for row, target in zip(X_train, y_train)]

decision_tree = create_decision_tree(data, attributes)

predictions = []
for sample in X_test:
    node = decision_tree
    while isinstance(node, dict):
        attribute = list(node.keys())[0]
        value = sample[attributes.index(attribute)]
        node = node[attribute][value]
    predictions.append(node)

accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

四、总结

本文介绍了C4.5算法在Python中的实现方法，详细说明了算法的原理和步骤，并通过一个鸢尾花分类的实例演示了算法的应用。C4.5算法是一种经典的决策树学习算法，在实际应用中具有较好的效果。