Chapter 1 Spark Cluster Mode OverView

参考文献

【2】简单介绍Spark在clusters模式的运行（application submission guide）

Components

Spark applications 已独立的processes运行在cluster，SparkContext（driver program)来负责统一管理

SparkContext可连接到不同的集群管理器（Spark’s own standalone cluster manager, Mesos or YARN），集群管理器负责分布资源

节点，链接之后在节点中得到executors，executors为applications执行计算和存储数据的工作，集群管理器将application代码（defined by JAR）发送到executors，最后SparkContext向executors发送task

这种架构值得注意的点有:

1.每一个application拥有自己的executor流程，在application整个生命周期被占用，executor以多线程的形式执行tasks。这样不同applications独立，但数据不同共享（除非写入外部存储）

2.Spark对cluster manager不强依赖，cluster manager可并行处理其他应用

3.driver program整个生命周期监听executors（client模式driver program是在cluster manager中的cluster manager也因此不同宕停）

4.driver program调度tasks，因此driver应该和worker nodes物理连接靠近，最好在本地网络域，远程最好开RPC

Cluster Manager Types

三种集群管理器:

Standalone – spark自带的简易管理器.
Apache Mesos – 一个通用的集群管理
Hadoop YARN – the resource manager in Hadoop 2.

Submitting Applications

application submission guide describes how to do this.

Monitoring

http://<driver-node>:4040 in a web browser to access this UI. The monitoring guide also describes other monitoring options.

Job Scheduling

job scheduling overview describes this in more detail.

Glossary

集群的一些概念:

Term	Meaning
Application	用户应用程序，包含driver program and executors
Application jar	用户应用程序的jar，不应含有 Hadoop or Spark libraries（运行时加载）
Driver program	运行main函数创建SparkContext的进程
Cluster manager	一个外部的service服务，用于管理集群 (e.g. standalone manager, Mesos, YARN)
Deploy mode	部署模式，区分driver process 在哪运行，在"cluster" mode，driver运行在集群application master中（driver在manager外负责和executor通信client能脱离集群），In "client" mode driver 在集群外client负责和executor通信 In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster.
Worker node	Any node that can run application code in the cluster
Executor	worker node中分配给application的进程，执行tasks，负责数据存储。应用间不共享
Task	A unit of work that will be sent to one executor
Job	一个包含多tasks的并行计算，由action 产生(gets spawned in response to a Spark action)，包换多个stage（一个stage可包含多个task，一个task通常负责一个partion）
Stage	task由shuffle划分为不同的Stage

job、stage、task

（1）物理上的：

Worker Node：物理节点，上面执行executor进程

Executor：Worker Node为某应用启动的一个进程，执行多个tasks

（2）软件上：

Jobs:action 的触发会生成一个job, Job会提交给DAGScheduler,分解成Stage,
Stage:DAGScheduler 根据shuffle将job划分为不同的stage，同一个stage中包含多个task，这些tasks有相同的 shuffle dependencies。有两类shuffle map stage和result stage。
Task:被送到executor上的工作单元，task简单的说就是在一个数据partition上的单个数据处理流程。
shuffle map stage：case its tasks' results are input for other stage(s)
result stage：case its tasks directly compute a Spark action (e.g. count(), save(), etc) by running a function on an RDD，输入与结果间划分stage

小结：

action触发一个job

------stage1（多个tasks 有相同的shuffle依赖）------【map--shuffle】------- stage2---- 【result--shuffle】-----

task对应在一个partition上的数据处理流程

实例讲解：Spark中job、stage、task的划分

Deploy mode

在yarn-cluster模式下，driver运行在AM(Application Master)中，它负责向YARN申请资源，并监督作业的运行状况。当用户提交了作业之后，就可以关掉Client，作业会继续在YARN上运行。

yarn-cluster模式不适合运行交互类型的作业。在yarn-client模式下，Application Master仅仅向YARN请求executor，client会和请求的container通信来调度他们工作，也就是说Client不能离开。下面的图形象表示了两者的区别。

第一章 Spark Cluster Mode OverView