-
大小: 1.54MB文件類型: .zip金幣: 2下載: 1 次發(fā)布日期: 2023-09-02
- 語言: 數(shù)據(jù)庫
- 標(biāo)簽: DataFrame??spark??sql??python??
資源簡介
(1)創(chuàng)建RDD
(2)將RDD轉(zhuǎn)為DataFrame
(3)調(diào)用registerTempTable,注冊為表,表名為:tb_book
(4)使用使用sql語句查詢前15條
(5)模糊查詢書名包含“微積分”的書
(6)輸出圖書的前10行的name和price字段信息
(7)統(tǒng)計(jì)書名包含“微積分”的書的數(shù)量
(8)查詢評分大于9的圖書,,且只展示前10條
(9)計(jì)算所有書名包含“微積分”的評分平均值
(10)把書目按照評分從高到低進(jìn)行排列,且只展示前15條
(11)把圖書按照出版社進(jìn)行分組,統(tǒng)計(jì)出不同出版社圖書的總數(shù)
(12)將書名包含“微積分”的書記錄保存到本地或HDFS上,且保存的格式為csv,文件名為:學(xué)號.csv
(13)然后再從該csv文件加載,創(chuàng)建DataFrame,并查詢和顯示

代碼片段和文件信息
from?pyspark.shell?import?sc
from?pyspark.sql.types?import?*
from?pyspark.sql?import?SparkSession
spark?=SparkSession.builder.master(“l(fā)ocal“).appName(“Word?Count“).config(“spark.some.config.option“?“some-value“).getOrCreate()
rdd?=?sc.textFile(“xxxxx.txt“)
header?=?rdd.first().split(““)
header1?=?rdd.first()
print(header)
schema?=?StructType([StructField(header[0]StringType()True)StructField(header[1]StringType()True)StructField(header[2]StringType()True)StructField(header[3]StringType()True)StructField(header[4]StringType()True)StructField(header[5]StringType()True)])
def?filter_line(line):
????if?line?==?header1:
????????return?False
????else:
????????return?True
rdd1?=?rdd.filter(lambda?line:line!=header1).map(lambda?line:line.split(““)).map(lambda?x:tuple(x))
df?=?rdd1.toDF(schema)
#?#?df.show()
#?“““
#?建表
#?“““
df.registerTempTable(“tb_book“)
#?spark.sql(“select?*?from?tb_book“).show(15)
#?spark.sql(“select?*?from?tb_book?where?‘書名‘?like?‘%%微積分%%‘“).show()
#?spark.sql(“select?‘書名‘‘價格‘?from?tb_book“).show(10)
#?number?=spark.sql(“select?*?from?tb_book?where?‘書名‘?like?‘%%微積分%%‘“).count()
#?print(number)
#?spark.sql(“select?*?from?tb_book?where?‘評分‘>‘9‘“).show(10)
#?spark.sql(“select?avg?(‘評分‘)?from?tb_book?where?‘書名‘?like?‘%%微積分%%‘?“).show()
#?spark.sql(“select?*?from?tb_book?order?by?‘評分‘?desc?“).show(15)
#?group?=spark.sql(“select?‘出版社‘count(‘出版社‘)?from?tb_book?GROUP?by?‘出版社‘“).collect()
#?print(group)
#?ddf?=spark.sql(“select?*?from?tb_book?where?‘書名‘?like?‘%%微積分%%‘“)
#?ddf.write.parquet(“/home/zhuang/138/tb_book“)
#?ddf.write.format(‘csv‘).save(“/home/zhuang/138/16034460138.csv“)
#?userDF=spark.read.format(“csv“).load(‘/home/zhuang/138/16034460138.csv/part-00000-e2c9db96-961e-45d7-8221-c2ff5e90d174-c000.csv‘)
#?userDF.printSchema()
#?userDF.show()
#?rdd_1?=sc.textFile(“/home/zhuang/138/16034460138.csv/part-00000-e2c9db96-961e-45d7-8221-c2ff5e90d174-c000.csv“)
#?rdd_2?=?rdd_1.map(lambda?line:line.split(““)).map(lambda?x:tuple(x))
#?dfff?=?rdd_2.toDF(schema)
#?dfff.select(“書名“).show()
?屬性????????????大小?????日期????時間???名稱
-----------?---------??----------?-----??----
?????文件?????5526130??2019-05-15?02:46??book.txt
?????文件????????2191??2019-05-16?00:54??test2.py
評論
共有 條評論