数值特征的预处理
首先机器学习模型可以大致分为两大类:
- Tree-based models - Non-tree-based models
对于Tree-based models,比如决策树分类器来说,因为数值缩放不影响分裂节点位置,对树模型的结构不造成影响。故对此类模型来说,对数值特征原则上无需进行预处理。
对于Non-tree-based models,比如线性模型,KNN, 神经网络来说,模型的质量依赖特征的尺度,下面介绍一些最常用的数值特征的预处理方法。
regularization
regularization最常用的方法:- MinMaxScaler: X=(X-X.min())/(X.max()-X.min())
- StandardScaler: X=(X-X.mean())/X.std()
regularization的影响:
- regularization impact turns out to be proportional to feature scale;
- gradient descent methods can go cracy without a proper scaling;
- differnt feature scaling result in diffrent models quality;
outliers
- outliers可能出现在feature values, 也可能出现在target values中;
- 有效的处理手段:clip feature values between two chosen values of lower bound and upper bound. eg, some percentiles of that feature.
rank transformation
- can be better option than MinMaxScaler if we have outliers, becanse rank transformation will move the outliers closer to other objects.
log transformation
- drive two big values closer to the feature's average value.
- 常用方法:np.log(1+x), np.sqrt(x+2/3)
数据融合
- concatenased data features produced by diffrent preprocessings;
- mix models training differntcy-preprocessed data
最后提一下feature generation
- 其定义是 creating new features using knowledge about the features and task.
- 有效的 feature generation 依赖于 creativity and data understanding.
- 方法: 1. prior knowladge, 2. EDA