pandas数据类型,pandas转换数据类型

pandas category数据类型实际应用pandas过程中，经常会用到category数据类型，通常以string的形式显示，包括颜色（红，绿，蓝），尺寸的大小（大，中，小），还有地理信息等（国家，省份），这些数据的处理经常会有各种各样的问题，pandas以及scikit-learn两个包可以将category数据转化为合适的数值型格式，这篇主要介绍通过这两个包处理category类型的数据转化为数值类型，也就是encoding的过程。数据来源UCI Machine Learning Repository，这个数据集中包含了很多的category类型的数据，可以从链接汇总查看数据的代表的含义。下面开始导入需要用到的包 import numpy as npimport pandas as pd # 规定一下数据列的各个名称，headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]# 从pandas导入csv文件，将?标记为NaN缺失值df=pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data",header=None,names=headers,na_values="?")df.head() symbolingnormalized_lossesmakefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationwheel_base...engine_sizefuel_systemborestrokecompression_ratiohorsepowerpeak_rpmcity_mpghighway_mpgprice03NaNalfa-romerogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.0111.05000.0212713495.013NaNalfa-romerogasstdtwoconvertiblerwdfront88.6...130mpfi3.472.689.0111.05000.0212716500.021NaNalfa-romerogasstdtwohatchbackrwdfront94.5...152mpfi2.683.479.0154.05000.0192616500.032164.0audigasstdfoursedanfwdfront99.8...109mpfi3.193.4010.0102.05500.0243013950.042164.0audigasstdfoursedan4wdfront99.4...136mpfi3.193.408.0115.05500.0182217450.0

5 rows × 26 columns

df.dtypes symboling int64normalized_losses float64make objectfuel_type objectaspiration objectnum_doors objectbody_style objectdrive_wheels objectengine_location objectwheel_base float64length float64width float64height float64curb_weight int64engine_type objectnum_cylinders objectengine_size int64fuel_system objectbore float64stroke float64compression_ratio float64horsepower float64peak_rpm float64city_mpg int64highway_mpg int64price float64dtype: object # 如果只关注category 类型的数据，其实根本没有必要拿到这些全部数据，只需要将object类型的数据取出，然后进行后续分析即可obj_df = df.select_dtypes(include=['object']).copy()obj_df.head() makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system0alfa-romerogasstdtwoconvertiblerwdfrontdohcfourmpfi1alfa-romerogasstdtwoconvertiblerwdfrontdohcfourmpfi2alfa-romerogasstdtwohatchbackrwdfrontohcvsixmpfi3audigasstdfoursedanfwdfrontohcfourmpfi4audigasstdfoursedan4wdfrontohcfivempfi # 在进行下一步处理的之前，需要将数据进行缺失值的处理，对列进行处理axis=1obj_df[obj_df.isnull().any(axis=1)] makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system27dodgegasturboNaNsedanfwdfrontohcfourmpfi63mazdadieselstdNaNsedanfwdfrontohcfouridi # 处理缺失值的方式有很多种，根据项目的不同或者填补缺失值或者去掉该样本。本文中的数据缺失用该列的众数来补充。obj_df.num_doors.value_counts() four 114two 89Name: num_doors, dtype: int64 obj_df=obj_df.fillna({"num_doors":"four"}) 在处理完缺失值之后，有以下几种方式进行category数据转化encoding Find and Replacelabel encodingOne Hot encodingCustom Binary encodingsklearnadvanced Approaches # pandas里面的replace文档非常丰富，笔者在使用该功能时候，深感其参数众多，深感提供的功能也非常的强大# 本文中使用replace的功能，创建map的字典，针对需要数据清理的列进行清理更加方便，例如：cleanup_nums= { "num_doors":{"four":4,"two":2}, "num_cylinders":{ "four":4,"six":6,"five":5,"eight":8,"two":2,"twelve":12,"three":3 }}obj_df.replace(cleanup_nums,inplace=True)obj_df.head() makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_system0alfa-romerogasstd2convertiblerwdfrontdohc4mpfi1alfa-romerogasstd2convertiblerwdfrontdohc4mpfi2alfa-romerogasstd2hatchbackrwdfrontohcv6mpfi3audigasstd4sedanfwdfrontohc4mpfi4audigasstd4sedan4wdfrontohc5mpfi label encoding 是将一组无规则的，没有大小比较的数据转化为数字比如body_style 字段中含有多个数据值，可以使用该方法将其转化convertible > 0hardtop > 1hatchback > 2sedan > 3wagon > 4 这种方式就像是密码编码一样，这，个比喻很有意思，就像之前看电影，记得一句台词，他们俩亲密的像做贼一样 # 通过pandas里面的 category数据类型，可以很方便的或者该编码obj_df["body_style"]=obj_df["body_style"].astype("category")obj_df.dtypes make objectfuel_type objectaspiration objectnum_doors int64body_style categorydrive_wheels objectengine_location objectengine_type objectnum_cylinders int64fuel_system objectdtype: object # 我们可以通过赋值新的列，保存其对应的code# 通过这种方法可以舒服的数据，便于以后的数据分析以及整理obj_df["body_style_code"] = obj_df["body_style"].cat.codesobj_df.head() makefuel_typeaspirationnum_doorsbody_styledrive_wheelsengine_locationengine_typenum_cylindersfuel_systembody_style_code0alfa-romerogasstd2convertiblerwdfrontdohc4mpfi01alfa-romerogasstd2convertiblerwdfrontdohc4mpfi02alfa-romerogasstd2hatchbackrwdfrontohcv6mpfi23audigasstd4sedanfwdfrontohc4mpfi34audigasstd4sedan4wdfrontohc5mpfi3 one hot encoding label encoding 因为将wagon转化为4，而convertible变成了0，这里面是不是会有大大小的比较，可能会造成误解，然后利用one hot encoding这种方式
是将特征转化为0或者1，这样会增加数据的列的数量，同时也减少了label encoding造成的衡量数据大小的误解。pandas中提供了get_dummies 方法可以将需要转化的列的值转化为0,1,两种编码 # 新生成DataFrame包含了新生成的三列数据,# drive_wheels_4wd # drive_wheels_fwd# drive_wheels_rwdpd.get_dummies(obj_df,columns=["drive_wheels"]).head() makefuel_typeaspirationnum_doorsbody_styleengine_locationengine_typenum_cylindersfuel_systembody_style_codedrive_wheels_4wddrive_wheels_fwddrive_wheels_rwd0alfa-romerogasstd2convertiblefrontdohc4mpfi00011alfa-romerogasstd2convertiblefrontdohc4mpfi00012alfa-romerogasstd2hatchbackfrontohcv6mpfi20013audigasstd4sedanfrontohc4mpfi30104audigasstd4sedanfrontohc5mpfi3100 # 该方法之所以强大，是因为可以同时处理多个category的列，同时选择prefix前缀分别对应好# 产生的新的DataFrame所有数据都包含pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"], prefix=["body", "drive"]).head() makefuel_typeaspirationnum_doorsengine_locationengine_typenum_cylindersfuel_systembody_style_codebody_convertiblebody_hardtopbody_hatchbackbody_sedanbody_wagondrive_4wddrive_fwddrive_rwd0alfa-romerogasstd2frontdohc4mpfi0100000011alfa-romerogasstd2frontdohc4mpfi0100000012alfa-romerogasstd2frontohcv6mpfi2001000013audigasstd4frontohc4mpfi3000100104audigasstd4frontohc5mpfi300010100 自定义0,1 encoding 有的时候回根据业务需要，可能会结合label encoding以及not hot 两种方式进行二值化。 obj_df["engine_type"].value_counts() ohc 148ohcf 15ohcv 13dohc 12l 12rotor 4dohcv 1Name: engine_type, dtype: int64 # 有的时候为了区分出 engine_type是否是och技术的，可以使用二值化，将该列进行处理# 这也突出了领域知识是如何以最有效的方式解决问题obj_df["engine_type_code"] = np.where(obj_df["engine_type"].str.contains("ohc"),1,0)obj_df[["make","engine_type","engine_type_code"]].head() makeengine_typeengine_type_code0alfa-romerodohc11alfa-romerodohc12alfa-romeroohcv13audiohc14audiohc1 scikit-learn中的数据转化 sklearn.processing模块提供了很多方便的数据转化以及缺失值处理方式(Imputer)，可以直接从该模块导入LabelEncoder，LabelBinarizer，0,1归一化(最大最小标准化)，Normalizer正则化（L1，L2）一般用的不多，标准化（最大最小标准化max_mix），非线性转换，生成多项式特征(PolynomialFeatures),将每个特征缩放在同样的范围或分布情况下sklearn processing 模块官网文档链接category_encoders包官方文档至此，数据预处理以及category转化大致讲完了。

posted on 2018-08-02 15:53 多一点阅读(...) 评论(...) 编辑收藏