SparkSQL文本处理语句

将长文本按空格进行划分

A dataframe clauses_df with 100 rows is provided. It has a column clause and a row id. Each clause is a string containing one or more words separated by spaces.
Split the clause column into a column called words, containing an array of individual words.

split_df = clauses_df.select(split('clause', ' ').alias('words'))

Explode the words column into a column called word.

exploded_df = split_df.select(explode('words').alias('word'))

Repartition 修改程序并行度

repart_df = text_df.repartition(12, 'chapter')

词频统计并从大到小排序

Our objective is to create a dataset where each row corresponds to a 5-tuple, having a count indicating how many times the tuple occurred in the dataset.

query = """
SELECT w1, w2, w3, w4, w5, COUNT(*) AS count FROM (
   SELECT word AS w1,
   LEAD(word,2) OVER(partition by part order by id ) AS w2,
   LEAD(word,1) OVER(partition by part order by id )AS w3,
   LAG(word,1) OVER(partition by part order by id ) AS w4,
   LAG(word,2) OVER(partition by part order by id ) AS w5
   FROM text
)
GROUP BY w1, w2, w3, w4, w5
ORDER BY count DESC
LIMIT 10 """
df = spark.sql(query)
df.show()

获取词频数最高的一行记录、交叉引用

#统计词频数并在组内进行排序
subquery = """
SELECT chapter, w1, w2, w3, COUNT(*) as count
FROM
(
    SELECT
    chapter,
    word AS w1,
    LEAD(word, 1) OVER(PARTITION BY chapter ORDER BY id ) AS w2,
    LEAD(word, 2) OVER(PARTITION BY chapter ORDER BY id ) AS w3
    FROM text
)
GROUP BY chapter, w1, w2, w3
ORDER BY chapter, count DESC
"""
#获取每组内词频数最大的对应的记录
#   Most frequent 3-tuple per chapter
query = """
SELECT chapter, w1, w2, w3, count FROM
(
  SELECT
  chapter,
  ROW_NUMBER() OVER (PARTITION BY chapter ORDER BY count DESC) AS row,
  w1, w2, w3, count
  FROM ( %s )
)
WHERE row = 1
ORDER BY chapter ASC
""" % subquery #交叉引用

spark.sql(query).show(3)
全部评论

相关推荐

找工作勤劳小蜜蜂:自我描述部分太差,完全看不出想从事什么行业什么岗位,也看不出想在哪个地区发展,这样 会让HR很犹豫,从而把你简历否决掉。现在企业都很注重员工稳定性和专注性,特别对于热爱本行业的员工。 你实习的工作又太传统的it开发(老旧),这部分公司已经趋于被淘汰,新兴的互联网服务业,比如物流,电商,新传媒,游戏开发和传统的It开发有天然区别。不是说传统It开发不行,而是就业岗位太少,基本趋于饱和,很多老骨头还能坚持,不需要新血液。 工作区域(比如长三角,珠三角,成渝)等也是HR考虑的因素之一,也是要你有个坚定的决心。否则去几天,人跑了,HR会被用人单位骂死。
点赞 评论 收藏
分享
评论
点赞
收藏
分享

创作者周榜

更多
牛客网
牛客网在线编程
牛客网题解
牛客企业服务