MapReduce是啥
通俗解释
比如计算一副牌有多少张,最简单的方法是自己一张一张数。
但MapReduce思想是把牌分给大家,大家一起数,数完加起来。
分而治之
应用在计算任务可以水平切分,而不是相互依赖
比如需要a事件完成才能执行b事件(可以理解为上下游关系)
Map和Reduce是什么呢
Map(映射):分配给集群机器,对每个目标应用同一操作
Reduce(归纳):整合部分结果
- file分为多个spilt,交给多个Mapper Task处理
- 处理完根据键值对进行shuffle,保证同一个key的放到一起
- 整合完交给reducer Task
- 写入hdfs
例子!!!!
文本
the weather is good today is good good weather is good today has good weather
通过split拆分
Split-0: [0, "the weather is good] Split-1: [1, "today is good"] Split-2: [2, "good weather is good"] Split-3: [3, "today has good weather]
Mapper映射
Mapper-0: ["the", 1], ["weather", 1], ["is", 1], ["good", 1] Mapper-1: ["today", 1], ["is", 1], ["good", 1] Mapper-2: ["good", 2], ["weather", 1], ["is", 1] Mapper-3: [today", 1], ["has", 1], ["good", 1], ["weater", 1]
shuffle
["good", {1, 1, 2, 1}] ["has", {1}] ["is", {1, 1, 1}] ["the", {1}] ["today", {1, 2}] ["weater", {1,1}]
reducer
Reducer-0: ["good", 5] Reducer-1: ["has", 1] Reducer-2: ["is", 3] Reducer-3: ["the", 1] Reducer-4: ["today", 2] Reducer-5: ["weather", 3]