資源簡介
kaggle比賽HousePrices之數(shù)據(jù)預處理部分的完整代碼,包含非常詳細的注釋,屬于數(shù)據(jù)挖掘預處理的經(jīng)典流程性代碼。
代碼片段和文件信息
#preprocessing?for?training&test?data
#@2016.11.08
import?pandas?as?pd
#step1:reading?csv?data
train?=?pd.read_csv(‘train.csv‘)
test?=?pd.read_csv(‘test.csv‘)
#train.head()???#?take?a?brief?look?at?training?data
all_data?=?pd.concat((train.loc[:‘MSSubClass‘:‘SaleCondition‘]
??????????????????????test.loc[:‘MSSubClass‘:‘SaleCondition‘]))??#?concat?training&test?data
import?numpy?as?np
from?scipy.stats?import?skew
import?matplotlib
matplotlib.use(‘Agg‘)
import?matplotlib.pyplot?as?plt
#step2:log?transform?for?training?data?(including?the?labels)
‘‘‘??a?png?for?labels‘?distribution
matplotlib.rcParams[‘figure.figsize‘]?=?(12.0?6.0)
prices?=?pd.Dataframe({“price“:train[“SalePrice“]?“l(fā)og(price?+?1)“:np.log1p(train[“SalePrice“])})
prices.hist()
plt.savefig(‘label_dist.png‘dpi=150)
‘‘‘
train[“SalePrice“]?=?np.log1p(train[“SalePrice“])?#log?transform?the?target
#log?transform?skewed?numeric?features:
numeric_feats?=?all_data.dtypes[all_data.dtypes?!=?“object“].index???#?get?the?index?of?all?the?n
評論
共有 條評論