表示层应用层,表示层的特点

文章目录 Mapping Raw Data to Features 将原始数据转换成特征Mapping numeric values 处理数值型数据Mapping categorical values 处理类别型数据Sparse Representation 稀疏表示法
In traditional programming, the focus is on code. In machine learning projects, the focus shifts to representation. That is, one way developers hone a model is by adding and improving its features.

在传统编程中，重点在代码。而在机器学习中，重点是表示层，即，开发者通过扩展和改进特征来优化模型性能。

Mapping Raw Data to Features 将原始数据转换成特征

The left side of Figure 1 illustrates raw data from an input data source; the right side illustrates a feature vector, which is the set of floating-point values comprising the examples in your data set. Feature engineering means transforming raw data into a feature vector. Expect to spend significant time doing feature engineering.

图1中的左侧部分是原始数据；右侧部分是转换后的特征向量，元素的类型是浮点型。特征工程的意思是，将原始数据转换成特征向量。我们将花费大量篇幅讲解特征工程。

Many machine learning models must represent the features as real-numbered vectors since the feature values must be multiplied by the model weights.

而且，由于特征数据需要与模型的权重进行乘法运算，所以多数机器学习模型必须将特征量化为实数向量。

Mapping numeric values 处理数值型数据

Integer and floating-point data don’t need a special encoding because they can be multiplied by a numeric weight. As suggested in Figure 2, converting the raw integer value 6 to the feature value 6.0 is trivial:

整数和浮点数无需额外转换，因为它们本身就可以跟数值型权重相乘。如图 2 建议的那样，我们最好把原始的整型数值 6 转换成 6.0：

Mapping categorical values 处理类别型数据

Categorical features have a discrete set of possible values. For example, there might be a feature called street_name with options that include:

类别型的特征是一组离散型的数值。举个栗子，一组名叫 street_name 的特征值如下：

{'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}

Since models cannot multiply strings by the learned weights, we use feature engineering to convert strings to numeric values.

由于字符串类型的数据无法和模型的权重做乘法，所以我们要使用特征工程方法把字符串型数据转换成数值型数据。

We can accomplish this by defining a mapping from the feature values, which we’ll refer to as the vocabulary of possible values, to integers. Since not every street in the world will appear in our dataset, we can group all other streets into a catch-all “other” category, known as an OOV (out-of-vocabulary) bucket.

要到达这个目的，我们可以定义一个映射表，我们称之为“词汇表”，通过这个表把字符串映射成整型数值。显然，映射表无法容纳地球上所有的街道，未出现在词汇表上的统称为 “other” 类目，即 OOV 桶。

Using this approach, here’s how we can map our street names to numbers:
使用这个方法，我们可以做如下映射：

map Charleston Road to 0map North Shoreline Boulevard to 1map Shorebird Way to 2map Rengstorff Avenue to 3map everything else (OOV) to 4Charleston Road 映射为 0North Shoreline Boulevard 映射为 1Shorebird Way 映射为 2Rengstorff Avenue 映射为 3其余的街道名称 (OOV) 映射为 4

However, if we incorporate these index numbers directly into our model, it will impose some constraints that might be problematic:

然而，如果我们直接把这些数字输入模型，会产生一些问题：

We’ll be learning a single weight that applies to all streets. For example, if we learn a weight of 6 for street_name, then we will multiply it by 0 for Charleston Road, by 1 for North Shoreline Boulevard, 2 for Shorebird Way and so on. Consider a model that predicts house prices using street_name as a feature. It is unlikely that there is a linear adjustment of price based on the street name, and furthermore this would assume you have ordered the streets based on their average house price. Our model needs the flexibility of learning different weights for each street that will be added to the price estimated using the other features.

We aren’t accounting for cases where street_name may take multiple values. For example, many houses are located at the corner of two streets, and there’s no way to encode that information in the street_name value if it contains a single index.

所有街道名称将共用一个模型权重，比如，如果我们 street_name 特征最终的模型权重是 6，那么对于 Charleston Road，模型会用 6 乘以 0，对于 North Shoreline Boulevard，用 6 乘以 1，对于 North Shoreline Boulevard，用 6 乘以 2 等等。试想一个以街道名称为输入特征来预测房价的模型，很难想象房价与房屋所在的街道名称成线性关系。我们的模型需要为每个街道训练出不同的权重，这样才能跟除了街道之外的其他特征一期预测房价；

除此之外，我们未能考虑到一座房子对应多个街道的情况。比如，很多房子是坐落在十字路口的，那么就无法使用上面的方法做映射了。

To remove both these constraints, we can instead create a binary vector for each categorical feature in our model that represents values as follows:

为了解决上述问题，我们可以为类别型特征值创建一个二值型的向量，其元素的含义如下：

For values that apply to the example, set corresponding vector elements to 1.Set all other elements to 0.命中类别的元素值设置为 1；其他的元素值设置为 0；
The length of this vector is equal to the number of elements in the vocabulary. This representation is called a one-hot encoding when a single value is 1, and a multi-hot encoding when multiple values are 1.

向量的长度和词汇表中元素的个数相同。如果只有一个元素值为 1，我们称之为
one-hot encoding，如果有多个元素值为 1，我们称之为 multi-hot encoding。
Figure 3 illustrates a one-hot encoding of a particular street: Shorebird Way. The element in the binary vector for Shorebird Way has a value of 1, while the elements for all other streets have values of 0.

This approach effectively creates a Boolean variable for every feature value (e.g., street name). Here, if a house is on Shorebird Way then the binary value is 1 only for Shorebird Way. Thus, the model uses only the weight for Shorebird Way.

这种方法能够有效地为每个特征（比如街道）值元素创建一个布尔值。

Similarly, if a house is at the corner of two streets, then two binary values are set to 1, and the model uses both their respective weights.

同样的，如果一座房子坐落在两条街道的路口，那对应的两个特征值元素都会被设置成 1，然后模型会同时使用对应的权重。

One-hot encoding extends to numeric data that you do not want to directly multiply by a weight, such as a postal code.
独热编码同样适用于无法跟权重直接相乘的数值型数据，比如邮政编码。

Sparse Representation 稀疏表示法

Suppose that you had 1,000,000 different street names in your data set that you wanted to include as values for street_name. Explicitly creating a binary vector of 1,000,000 elements where only 1 or 2 elements are true is a very inefficient representation in terms of both storage and computation time when processing these vectors. In this situation, a common approach is to use a sparse representation in which only nonzero values are stored. In sparse representations, an independent model weight is still learned for each feature value, as described above.

假设你的 street_name 集合中有 1,000,000 个不同的街道名称，使用上面的方法我们需要创建一个具有 1,000,000 个元素的向量，其中只有 1 或 2 个元素的值是 1，显然这种做法无论是从存储还是计算的角度上看都是低效的。在这种情况下，我们需要另一个通用的方法，稀疏表示法，使用这种方法我们只需要保存非 0 值，而同样能达到上面方法的效果。