資源簡介
西電數據挖掘大作業之商場數據分析。

代碼片段和文件信息
#?-*-?coding:?utf-8?-*-
“““
Created?on?Sat?Aug??25?13:45:40?2018
@author:?Pratik
“““
import?pandas?as?pd
import?numpy?as?np
import?seaborn?as?sns
sns.set()
import?matplotlib.pyplot?as?plt
from?sklearn.neighbors?import?KNeighborsClassifier
knn?=?KNeighborsClassifier(n_neighbors=5)
train?=?pd.read_csv(‘Train.csv‘)
test?=?pd.read_csv(‘Test.csv‘)
#?We?will?combine?the?train?and?test?data?to?perform?feature?engineering我們將結合訓練和測試數據進行特征工程
train[‘source‘]?=?‘train‘
test[‘source‘]?=?‘test‘
data?=?pd.concat([train?test]?ignore_index=True)
print(‘--------------------------------------------------------------‘)
print(train.shape?test.shape?data.shape)
print(‘--------------------------------------------------------------\n‘)
#?As?the?problem?is?already?defined?--?we?know?that?we?need?to?predict?sales?by?the?store??問題已經定義好了——我們知道我們需要預測商店的銷售額
data.info()
data.describe()
#?Some?observations
#?1.?item_visibility?has?min?value?of?0?which?is?less?likely??項目可見性的最小值為0,這是不太可能的
#?2.?Outlet_Establishment_Year?will?be?more?useful?in?a?way?by?which?we?could?know?how?old?it?is?在某種程度上,網點建立年將更有用,這樣我們就可以知道它的年齡
#?Lets?check?how?many?unique?items?each?column?has?讓我們檢查每個列有多少個惟一項
data.apply(lambda?x:?len(x.unique()))
#?Let?us?have?a?look?at?the?object?datatype?columns??讓我們看一下對象數據類型列
for?i?in?train.columns:
????if?train[i].dtype?==?‘object‘:
????????print(train[i].value_counts())
????????print(‘--------------------------------------------\n‘)
????????print(‘--------------------------------------------‘)
#?The?output?gives?us?following?observations:輸出結果給出了以下觀察結果
#?Item_Fat_Content:?Some?of?‘Low?Fat’?values?mis-coded?as?‘low?fat’?and?‘LF’.?Also?some?of?‘Regular’?are?mentioned?as?‘regular’.項目脂肪含量:一些低脂值被錯誤編碼為低脂和低脂。此外,一些規則也被稱為規則。
#?Item_Type:?Not?all?categories?have?substantial?numbers.?It?looks?like?combining?them?can?give?better?results.項目類型:不是所有的類別都有大量的數字。看起來把它們結合在一起可以得到更好的結果。
#?Outlet_Type:?Supermarket?Type2?and?Type3?can?be?combined.?But?we?should?check?if?that’s?a?good?idea?before?doing?it.出口類型:超市2型和3型可組合。但是我們應該在做這件事之前檢查一下這是不是一個好主意。
#?missing?value?percentage缺失值的百分比
#?Item_Weight?and?Outlet_Size?has?some?missing?values
print(‘--------------------------------------------‘)
print(‘missing?value?percentage:‘)
print((data[data[‘Item_Weight‘].isnull()].shape[0]?/?data.shape[0])?*?100)
print((data[data[‘Outlet_Size‘].isnull()].shape[0]?/?data.shape[0])?*?100)
print(‘--------------------------------------------\n‘)
#?we?impute?missing?values
data[‘Item_Weight‘]?=?data[‘Item_Weight‘].fillna(data[‘Item_Weight‘].mean())
#?data[‘Outlet_Size‘]?=?data[‘Outlet_Size‘].fillna(data[‘Outlet_Size‘].mode())
data[‘Outlet_Size‘].fillna(data[‘Outlet_Size‘].mode()[0]?inplace=True)
#?lets?change?item_visibility?from?0?to?mean?to?make?sense讓我們將項目可見性從0更改為有意的
data[‘Item_Visibility‘]?=?data[‘Item_Visibility‘].replace(
????0?data[‘Item_Visibility‘].mean())
#?we?will?calculate?meanRatio?for?each?object‘s?visibility我
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????目錄???????????0??2018-10-17?20:59??bigmart-master\
?????文件????????1203??2018-09-07?02:49??bigmart-master\.gitignore
?????文件??????181844??2018-10-18?22:29??bigmart-master\alg0.csv
?????文件??????112910??2018-10-17?21:32??bigmart-master\alg1.csv
?????文件??????179127??2018-10-17?21:32??bigmart-master\alg2.csv
?????文件??????177867??2018-10-17?21:32??bigmart-master\alg3.csv
?????文件??????178794??2018-10-17?21:32??bigmart-master\alg6.csv
?????文件????????9337??2018-10-18?22:29??bigmart-master\BigMart.py
?????文件??????????37??2018-09-07?02:49??bigmart-master\README.md
?????文件??????527709??2018-09-07?02:49??bigmart-master\Test.csv
?????文件??????965049??2018-10-18?22:29??bigmart-master\test_modified.csv
?????文件??????869537??2018-09-07?02:49??bigmart-master\Train.csv
?????文件?????1534109??2018-10-18?22:29??bigmart-master\train_modified.csv
評論
共有 條評論