SparkSQL文本处理语句

将长文本按空格进行划分

A dataframe clauses_df with 100 rows is provided. It has a column clause and a row id. Each clause is a string containing one or more words separated by spaces.
Split the clause column into a column called words, containing an array of individual words.

split_df = clauses_df.select(split('clause', ' ').alias('words'))

Explode the words column into a column called word.

exploded_df = split_df.select(explode('words').alias('word'))

Repartition 修改程序并行度

repart_df = text_df.repartition(12, 'chapter')

词频统计并从大到小排序

Our objective is to create a dataset where each row corresponds to a 5-tuple, having a count indicating how many times the tuple occurred in the dataset.

query = """
SELECT w1, w2, w3, w4, w5, COUNT(*) AS count FROM (
   SELECT word AS w1,
   LEAD(word,2) OVER(partition by part order by id ) AS w2,
   LEAD(word,1) OVER(partition by part order by id )AS w3,
   LAG(word,1) OVER(partition by part order by id ) AS w4,
   LAG(word,2) OVER(partition by part order by id ) AS w5
   FROM text
)
GROUP BY w1, w2, w3, w4, w5
ORDER BY count DESC
LIMIT 10 """
df = spark.sql(query)
df.show()

获取词频数最高的一行记录、交叉引用

#统计词频数并在组内进行排序
subquery = """
SELECT chapter, w1, w2, w3, COUNT(*) as count
FROM
(
    SELECT
    chapter,
    word AS w1,
    LEAD(word, 1) OVER(PARTITION BY chapter ORDER BY id ) AS w2,
    LEAD(word, 2) OVER(PARTITION BY chapter ORDER BY id ) AS w3
    FROM text
)
GROUP BY chapter, w1, w2, w3
ORDER BY chapter, count DESC
"""
#获取每组内词频数最大的对应的记录
#   Most frequent 3-tuple per chapter
query = """
SELECT chapter, w1, w2, w3, count FROM
(
  SELECT
  chapter,
  ROW_NUMBER() OVER (PARTITION BY chapter ORDER BY count DESC) AS row,
  w1, w2, w3, count
  FROM ( %s )
)
WHERE row = 1
ORDER BY chapter ASC
""" % subquery #交叉引用

spark.sql(query).show(3)
全部评论

相关推荐

字节 飞书绩效团队 (n+2) * 15 + 1k * 12 + 1w
点赞 评论 收藏
分享
11-02 09:49
已编辑
货拉拉_测试(实习员工)
热爱生活的仰泳鲈鱼求你们别卷了:没事楼主,有反转查看图片
点赞 评论 收藏
分享
点赞 评论 收藏
分享
点赞 收藏 评论
分享
牛客网
牛客企业服务