-
大小: 19.8MB文件類型: .zip金幣: 1下載: 0 次發(fā)布日期: 2023-07-03
- 語言: Python
- 標(biāo)簽: 數(shù)據(jù)挖掘??
資源簡介
本壓縮包為pycharm工程文件,其中movie文件夾內(nèi)為movielens的數(shù)據(jù)集,100k條數(shù)據(jù)。代碼為python3.6,注釋詳細(xì)。歡迎一起學(xué)習(xí)。
代碼片段和文件信息
import?os
import?pandas?as?pd
from?collections?import?defaultdict
from?operator?import?itemgetter
import?sys
data_folder?=?os.path.join(os.path.expanduser(“~“)?“Data“?“movie“)
ratings_filename?=?os.path.join(data_folder?“u.data“)
#?由于第一行就是數(shù)據(jù)部分,所以要手動添加表頭,設(shè)置各列名稱
all_ratings?=?pd.read_csv(ratings_filename?delimiter=“\t“?header=None
??????????????????????????names=[“UserID“?“MovieID“?“Rating“?“Datetime“])
print(all_ratings[:5])
#?創(chuàng)建新特征。如果評分大于3,就設(shè)定用戶喜歡這部電影
all_ratings[“Favorable“]?=?all_ratings[“Rating“]?>?3
print(all_ratings[10:15])
#?取前200個用戶的打分?jǐn)?shù)據(jù)
ratings?=?all_ratings[all_ratings[‘UserID‘].isin(range(200))]
favorable_ratings?=?ratings[ratings[“Favorable“]]
#?每個用戶各喜歡哪些電影
favorable_reviews_by_users?=?dict((k?frozenset(v.values))?for?k?v?in?favorable_ratings.groupby(“UserID“)[“MovieID“])
#?每部電影的喜歡的人數(shù)量
num_favorable_by_movie?=?ratings[[“MovieID“?“Favorable“]].groupby(“MovieID“).sum()
#?最受歡迎的五部電影
print(num_favorable_by_movie.sort_values(by=“Favorable“?ascending=False).head())
#?實(shí)現(xiàn)Apriori算法
#?頻繁項(xiàng)集
frequent_itemsets?=?{}
#?稱為頻繁項(xiàng)集的最小支持度
min_support?=?50
#?為每一部電影生成只包含它自己的項(xiàng)集,檢測它是否頻繁
frequent_itemsets[1]?=?dict((frozenset((movie_id))?row[“Favorable“])
????????????????????????????for?movie_id?row?in?num_favorable_by_movie.iterrows()
????????????????????????????if?row[“Favorable“]?>?min_support)
print(“有?{}?部電影有至少{}?個好評“.format(len(frequent_itemsets[1])?min_support))
sys.stdout.flush()
#?接收新發(fā)現(xiàn)的頻繁項(xiàng)集,創(chuàng)建超集,檢測頻繁程度
def?find_frequent_itemsets(favorable_reviews_by_users?k_1_itemsets?min_support):
????counts?=?defaultdict(int)
????#?遍歷所有用戶和他們的打分?jǐn)?shù)據(jù)
????for?user?reviews?in?favorable_reviews_by_users.items():
????????#?遍歷前面找出的項(xiàng)集,判斷是否是當(dāng)前評分項(xiàng)集的子集,如果是,說明用戶已經(jīng)為子集中的電影評分
????????for?itemset?in?k_1_itemsets:
????????????if?itemset.issubset(reviews):
????????????????#?遍歷用戶打過分但是沒有出現(xiàn)在項(xiàng)集里的電影,生成超集。更新該項(xiàng)集的計(jì)數(shù)。
????????????????for?other_reviewed_movie?in?reviews?-?itemset:
????????????????????current_superset?=?itemset?|?frozenset((other_reviewed_movie))
????????????????????counts[current_superset]?+=?1
????#?檢測達(dá)到最小支持度的項(xiàng)集,看他頻繁程度是否達(dá)到要求,返回頻繁項(xiàng)集
????return?dict([(itemset?frequency)?for?itemset?frequency?in?counts.items()?if?frequency?>=?min_support])
#?運(yùn)行Apriori算法
for?k?in?range(2?20):
????cur_frequent_itemsets?=?find_frequent_itemsets(favorable_reviews_by_users?frequent_itemsets[k?-?1]
???????????????????????????????????????????????????min_support)
????if?len(cur_frequent_itemsets)?==?0:
????????print(“沒有找到長度為{}的頻繁項(xiàng)集“.format(k))
????????sys.stdout.flush()??#?確保代碼還在運(yùn)行時,把緩沖區(qū)內(nèi)容輸出到終端,保證即時輸出
????????break
????else:
????????print(“找到了{(lán)}個長度為{}的頻繁項(xiàng)集“.format(len(cur_frequent_itemsets)?k))
????????sys.stdout.flush()
????????frequent_itemsets[k]?=?cur_frequent_itemsets
#?刪除長度為1的項(xiàng)集
del?frequent_itemsets[1]
#?為每個項(xiàng)集生成規(guī)則:??如果用戶喜歡前提中的所有電影,那么他們也會喜歡結(jié)論中的電影
candidate_rules?=?[]
for?itemset_length?itemset_counts?in?frequent_itemsets.items():
????for?itemset?in?itemset_counts.keys():
????????#?遍歷每一
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????目錄???????????0??2018-01-11?15:54??movie_pro\
?????目錄???????????0??2018-01-11?15:53??movie_pro\.idea\
?????目錄???????????0??2018-01-11?10:32??movie_pro\.idea\libraries\
?????文件?????????128??2018-01-11?10:32??movie_pro\.idea\libraries\R_User_Library.xm
?????文件?????????185??2018-01-11?10:34??movie_pro\.idea\misc.xm
?????文件?????????270??2018-01-11?10:30??movie_pro\.idea\modules.xm
?????文件?????????700??2018-01-11?10:34??movie_pro\.idea\movie_pro.iml
?????文件???????12020??2018-01-11?15:53??movie_pro\.idea\workspace.xm
?????目錄???????????0??2018-01-11?15:54??movie_pro\movie\
?????文件????????7729??2018-01-11?15:49??movie_pro\movie.py
?????文件?????????716??2000-07-20?05:09??movie_pro\movie\allbut.pl
?????文件?????????643??2000-07-20?05:09??movie_pro\movie\mku.sh
?????文件????????6750??2016-01-30?04:26??movie_pro\movie\README
?????文件?????1979173??2000-07-20?05:09??movie_pro\movie\u.data
?????文件?????????202??2000-07-20?05:09??movie_pro\movie\u.genre
?????文件??????????36??2000-07-20?05:09??movie_pro\movie\u.info
?????文件??????236344??2000-07-20?05:09??movie_pro\movie\u.item
?????文件?????????193??2000-07-20?05:09??movie_pro\movie\u.occupation
?????文件???????22628??2000-07-20?05:09??movie_pro\movie\u.user
?????文件?????1586544??2001-03-09?02:33??movie_pro\movie\u1.ba
?????文件??????392629??2001-03-09?02:32??movie_pro\movie\u1.test
?????文件?????1583948??2001-03-09?02:33??movie_pro\movie\u2.ba
?????文件??????395225??2001-03-09?02:33??movie_pro\movie\u2.test
?????文件?????1582546??2001-03-09?02:33??movie_pro\movie\u3.ba
?????文件??????396627??2001-03-09?02:33??movie_pro\movie\u3.test
?????文件?????1581878??2001-03-09?02:33??movie_pro\movie\u4.ba
?????文件??????397295??2001-03-09?02:33??movie_pro\movie\u4.test
?????文件?????1581776??2001-03-09?02:34??movie_pro\movie\u5.ba
?????文件??????397397??2001-03-09?02:33??movie_pro\movie\u5.test
?????文件?????1792501??2001-03-09?02:34??movie_pro\movie\ua.ba
?????文件??????186672??2001-03-09?02:34??movie_pro\movie\ua.test
............此處省略696個文件信息
評論
共有 條評論