Spark Environment Setup

Learninghow to set upSpark单机environment and clusterenvironment, includinginstallation, configuration, 启动 and management

Spark Environment Setup

1. Environment Setup before 准备

in 搭建Sparkenvironment之 before , 需要先准备 good 所需 硬件 and 软件environment. 以 under is 搭建Sparkenvironment basic要求:

1.1 硬件要求

  • 单机environment: 至 few 4GBmemory, 2核CPU, 50GB硬盘空间
  • clusterenvironment:
    • Masternode: 至 few 8GBmemory, 4核CPU, 100GB硬盘空间
    • Workernode: 至 few 4GBmemory, 2核CPU, 50GB硬盘空间
    • network: 所 has node之间需要 has stable network连接, 建议using千兆以太网

1.2 软件要求

  • operationsystem: Linux (推荐) , Windows, macOS
  • Java: JDK 8 or 更 high version
  • Hadoop: (可选) Hadoop 2.7 or 更 high version, 用于HDFS and YARN集成
  • Python: (可选) Python 3.6 or 更 high version, 用于PySpark
  • Scala: (可选) Scala 2.12 or 更 high version, 用于ScalaDevelopment

1.3 under 载Spark

from Apache Spark官网 under 载最 new version Sparkinstallationpackage. 建议选择带 has Hadoopsupport 预编译version, 以简化 and Hadoop 集成.

#  under 载Sparkinstallationpackage (以Spark 3.5.0 for 例) 
$ wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

# 解压installationpackage
$ tar -xzvf spark-3.5.0-bin-hadoop3.tgz

# move to 指定Table of Contents
$ mv spark-3.5.0-bin-hadoop3 /opt/spark

2. 单机Environment Setup

单机environment is Learning and DevelopmentSparkapplication程序 Basicsenvironment, configuration simple , 适合初学者.

2.1 configurationenvironmentvariable

编辑/etc/profilefile or user .bashrcfile, 添加Spark environmentvariable:

# 打开.bashrcfile
$ vi ~/.bashrc

# 添加以 under  in 容
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

# 使environmentvariable生效
$ source ~/.bashrc

2.2 configurationSpark

Spark configurationfile位于$SPARK_HOME/confTable of Contents under . copy默认configurationfile并formodify:

# 进入configurationTable of Contents
$ cd $SPARK_HOME/conf

# copy默认configurationfile
$ cp spark-env.sh.template spark-env.sh
$ cp log4j.properties.template log4j.properties

# 编辑spark-env.shfile, 添加以 under  in 容
$ vi spark-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# configurationlog级别 for WARN, reducinglog输出
$ vi log4j.properties
# 将log4j.rootCategory 值 from INFO改 for WARN
log4j.rootCategory=WARN, console

2.3 启动Spark Shell

Spark Shell is a 交互式 Sparkenvironment, 可以用于Learning and testSpark API.

# 启动Spark Shell (Scalaversion) 
$ spark-shell

# 启动PySpark (Pythonversion) 
$ pyspark

# 启动Spark SQL Shell
$ spark-sql

2.4 testSpark Shell

in Spark Shellin执行 simple operation, verificationenvironment is 否正常:

# Scalaversionexample
// creation一个RDD
scala> val rdd = sc.parallelize(1 to 10)

// 计算RDD元素 总 and 
scala> rdd.sum()
res0: Double = 55.0

// Pythonversionexample
>>> rdd = sc.parallelize(range(1, 11))
>>> rdd.sum()
55

// Spark SQLexample
spark-sql> CREATE TABLE test_table (id INT, name STRING);
spark-sql> INSERT INTO test_table VALUES (1, 'test');
spark-sql> SELECT * FROM test_table;

3. clusterEnvironment Setup

clusterenvironment适合produceenvironment or large-scaledataprocessing, 需要 many 台机器协同工作. 以 under is 搭建Spark Standalonecluster 步骤:

3.1 clusterplanning

fake设我们 has 3台机器, planningsuch as under :

  • Masternode: spark-master (IP: 192.168.1.100)
  • Workernode1: spark-worker1 (IP: 192.168.1.101)
  • Workernode2: spark-worker2 (IP: 192.168.1.102)

3.2 configurationSSH免passwordlogin

in 所 has node之间configurationSSH免passwordlogin, 以便Spark可以 in node之间通信:

#  in Masternode on 生成SSHkey for 
$ ssh-keygen -t rsa -P ""

# 将公钥copy to 所 has Workernode
$ ssh-copy-id spark-worker1
$ ssh-copy-id spark-worker2

# verificationSSH免passwordlogin
$ ssh spark-worker1

3.3 configurationMasternode

in Masternode on configurationSparkenvironment:

# 编辑spark-env.shfile
$ vi $SPARK_HOME/conf/spark-env.sh

# 添加以 under  in 容
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_MASTER_HOST=spark-master
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_WEBUI_PORT=8081

# creationworkersfile, 添加所 has Workernode 主机名
$ vi $SPARK_HOME/conf/workers
spark-worker1
spark-worker2

3.4 configurationWorkernode

in 所 has Workernode on 执行相同 configuration步骤:

# copyMasternode configuration to 所 has Workernode
$ scp -r $SPARK_HOME/conf spark-worker1:$SPARK_HOME/
$ scp -r $SPARK_HOME/conf spark-worker2:$SPARK_HOME/

3.5 启动Sparkcluster

in Masternode on 启动Sparkcluster:

# 启动Sparkcluster
$ $SPARK_HOME/sbin/start-all.sh

# 查看clusterstatus
$ $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://spark-master:7077 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100

3.6 访问Spark Web UI

through浏览器访问Spark Web UI, 查看clusterstatus:

  • Master Web UI: http://spark-master:8080
  • Worker Web UI: http://spark-worker1:8081, http://spark-worker2:8081
  • Application Web UI: http://spark-master:4040 (runapplication程序时可用)

3.7 停止Sparkcluster

in Masternode on 停止Sparkcluster:

# 停止Sparkcluster
$ $SPARK_HOME/sbin/stop-all.sh

# 单独停止Masternode
$ $SPARK_HOME/sbin/stop-master.sh

# 单独停止Workernode
$ $SPARK_HOME/sbin/stop-worker.sh

4. in YARN on deploymentSpark

除了Standalone模式 out , Spark还可以deployment in YARN on . 以 under is in YARN on deploymentSpark 步骤:

4.1 configurationHadoopenvironment

确保Hadoopenvironment已经正确configuration, 并且YARNservice正 in run:

# checkYARNservicestatus
$ yarn node -list

# such as果YARN未启动, 启动Hadoopservice
$ start-dfs.sh
$ start-yarn.sh

4.2 configurationSpark

in spark-env.shfilein添加HadoopconfigurationTable of Contents:

# 编辑spark-env.shfile
$ vi $SPARK_HOME/conf/spark-env.sh

# 添加HADOOP_CONF_DIR
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

4.3 submittingSparkjob to YARN

usingspark-submitcommands将jobsubmitting to YARN:

# 以YARN客户端模式submittingjob
$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100

# 以YARNcluster模式submittingjob
$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100

5. in Kubernetes on deploymentSpark

Spark也可以deployment in Kubernetescluster on . 以 under is basic步骤:

5.1 准备Kubernetesenvironment

确保Kubernetescluster已经正确configuration, 并且kubectlcommands可以正常using:

# checkKubernetesclusterstatus
$ kubectl cluster-info

# checknodestatus
$ kubectl get nodes

5.2 构建Spark Docker镜像

Sparkproviding了构建Docker镜像 脚本:

# 构建Docker镜像
$ cd $SPARK_HOME
$ ./bin/docker-image-tool.sh -r myregistry -t v3.5.0 build

5.3 submittingSparkjob to Kubernetes

usingspark-submitcommands将jobsubmitting to Kubernetes:

# submittingjob to Kubernetes
$ spark-submit --master k8s://https://kubernetes-master:6443 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=2 \
  --conf spark.kubernetes.container.image=myregistry/spark:v3.5.0 \
  --conf spark.kubernetes.namespace=spark \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar 100

6. Sparkenvironmentvariable and configuration

Spark configuration可以throughenvironmentvariable, configurationfile or commands行parameterfor设置. 以 under is 一些常用 configuration项:

6.1 environmentvariable

  • JAVA_HOME: JavainstallationTable of Contents
  • SPARK_HOME: SparkinstallationTable of Contents
  • SPARK_MASTER_HOST: Masternode主机名
  • SPARK_MASTER_PORT: Masternode端口, 默认7077
  • SPARK_WORKER_CORES: Workernode可用核数
  • SPARK_WORKER_MEMORY: Workernode可用memory
  • HADOOP_CONF_DIR: HadoopconfigurationTable of Contents

6.2 configurationfile

  • spark-env.sh: 设置environmentvariable and 启动parameter
  • spark-defaults.conf: 设置Spark默认configuration项
  • workers: 设置Workernodelist
  • log4j.properties: 设置log级别 and 格式

6.3 常用configuration项

#  in spark-defaults.confin添加以 under configuration
spark.master                     yarn
spark.driver.memory              1g
spark.executor.memory            2g
spark.executor.cores             2
spark.sql.shuffle.partitions     200
spark.default.parallelism        100
spark.serializer                 org.apache.spark.serializer.KryoSerializer

实践练习

练习1: 单机Environment Setup

  1. under 载并installationSpark单机版
  2. configurationenvironmentvariable
  3. 启动Spark Shell并执行 simple operation
  4. 启动PySpark并执行 simple operation

练习2: clusterEnvironment Setup

  1. 准备3台虚拟机 or server
  2. configurationSSH免passwordlogin
  3. configurationSparkcluster
  4. 启动Sparkcluster
  5. 访问Spark Web UI
  6. submittingSparkjob to cluster

练习3: YARN模式deployment

  1. installation并configurationHadoopenvironment
  2. configurationSpark and YARN集成
  3. 以YARN客户端模式submittingjob
  4. 以YARNcluster模式submittingjob

练习4: configurationoptimization

  1. 调整Spark log级别
  2. configurationSpark memory and CPUresource
  3. configurationSpark 序列化方式
  4. configurationSpark SQL shufflepartition数

7. commonissues及solution

in 搭建Sparkenvironment过程in, 可能会遇 to 一些commonissues. 以 under is 一些issues solution:

7.1 Javaversion不兼容

issues: 启动Spark时提示Javaversion不兼容.
solution: 确保installation Javaversion符合Spark 要求, 建议usingJDK 8 or 更 high version.

7.2 端口被占用

issues: 启动Spark时提示端口被占用.
solution: modifyspark-env.shfilein 端口configuration, using未被占用 端口.

7.3 Workernode无法连接 to Master

issues: Workernode无法连接 to Masternode.
solution:

  • checkMasternode 主机名 and 端口configuration is 否正确
  • checknetwork连接 is 否正常
  • check防火墙 is 否开放了相关端口

7.4 memory不足

issues: 启动Sparkapplication程序时提示memory不足.
solution:

  • 增加机器 物理memory
  • 调整Spark memoryconfiguration, reducingdriver or executor memory分配
  • reducingparallel度, 降 low memoryusing

8. summarized

本tutorial详细介绍了SparkEnvironment Setup 各种方式, including单机environment, Standalonecluster, YARN模式 and Kubernetes模式. 每种deployment模式都 has 其适用场景, 需要根据practicalrequirements选择合适 deployment方式.

in 搭建Sparkenvironment时, 需要注意以 under 几点:

  • 确保Javaversion符合要求
  • 正确configurationenvironmentvariable
  • 合理configurationmemory and CPUresource
  • 注意端口conflictissues
  • configuration适当 log级别

through本tutorial Learning, 您应该able to成功搭建Sparkenvironment, 并开始usingSparkfor big dataprocessing. in after 续tutorialin, 我们将深入LearningSpark core concepts and advanced features, MasterSpark 各种application场景 and best practices.