SparkSQL文本处理语句
将长文本按空格进行划分
A dataframe clauses_df with 100 rows is provided. It has a column clause and a row id. Each clause is a string containing one or more words separated by spaces.
Split the clause column into a column called words, containing an array of individual words.
split_df = clauses_df.select(split('clause', ' ').alias('words'))
Explode the words column into a column called word.
exploded_df = split_df.select(explode('words').alias('word'))
Repartition 修改程序并行度
repart_df = text_df.repartition(12, 'chapter')
词频统计并从大到小排序
Our objective is to create a dataset where each row corresponds to a 5-tuple, having a count indicating how many times the tuple occurred in the dataset.
query = """ SELECT w1, w2, w3, w4, w5, COUNT(*) AS count FROM ( SELECT word AS w1, LEAD(word,2) OVER(partition by part order by id ) AS w2, LEAD(word,1) OVER(partition by part order by id )AS w3, LAG(word,1) OVER(partition by part order by id ) AS w4, LAG(word,2) OVER(partition by part order by id ) AS w5 FROM text ) GROUP BY w1, w2, w3, w4, w5 ORDER BY count DESC LIMIT 10 """ df = spark.sql(query) df.show()
获取词频数最高的一行记录、交叉引用
#统计词频数并在组内进行排序 subquery = """ SELECT chapter, w1, w2, w3, COUNT(*) as count FROM ( SELECT chapter, word AS w1, LEAD(word, 1) OVER(PARTITION BY chapter ORDER BY id ) AS w2, LEAD(word, 2) OVER(PARTITION BY chapter ORDER BY id ) AS w3 FROM text ) GROUP BY chapter, w1, w2, w3 ORDER BY chapter, count DESC """
#获取每组内词频数最大的对应的记录 # Most frequent 3-tuple per chapter query = """ SELECT chapter, w1, w2, w3, count FROM ( SELECT chapter, ROW_NUMBER() OVER (PARTITION BY chapter ORDER BY count DESC) AS row, w1, w2, w3, count FROM ( %s ) ) WHERE row = 1 ORDER BY chapter ASC """ % subquery #交叉引用 spark.sql(query).show(3)