Spark Environment Setup

1. Environment Setup before 准备

in 搭建Sparkenvironment之 before , 需要先准备 good 所需硬件 and 软件environment. 以 under is 搭建Sparkenvironment basic要求:

1.1 硬件要求

单机environment: 至 few 4GBmemory, 2核CPU, 50GB硬盘空间
clusterenvironment:
- Masternode: 至 few 8GBmemory, 4核CPU, 100GB硬盘空间
- Workernode: 至 few 4GBmemory, 2核CPU, 50GB硬盘空间
- network: 所 has node之间需要 has stable network连接, 建议using千兆以太网

1.2 软件要求

operationsystem: Linux (推荐) , Windows, macOS
Java: JDK 8 or 更 high version
Hadoop: (可选) Hadoop 2.7 or 更 high version, 用于HDFS and YARN集成
Python: (可选) Python 3.6 or 更 high version, 用于PySpark
Scala: (可选) Scala 2.12 or 更 high version, 用于ScalaDevelopment

1.3 under 载Spark

from Apache Spark官网 under 载最 new version Sparkinstallationpackage. 建议选择带 has Hadoopsupport 预编译version, 以简化 and Hadoop 集成.

#  under 载Sparkinstallationpackage (以Spark 3.5.0 for 例) 
$ wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

# 解压installationpackage
$ tar -xzvf spark-3.5.0-bin-hadoop3.tgz

# move to 指定Table of Contents
$ mv spark-3.5.0-bin-hadoop3 /opt/spark

2. 单机Environment Setup

单机environment is Learning and DevelopmentSparkapplication程序 Basicsenvironment, configuration simple , 适合初学者.

2.1 configurationenvironmentvariable

编辑/etc/profilefile or user .bashrcfile, 添加Spark environmentvariable:

# 打开.bashrcfile
$ vi ~/.bashrc

# 添加以 under  in 容
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

# 使environmentvariable生效
$ source ~/.bashrc

2.2 configurationSpark

Spark configurationfile位于$SPARK_HOME/confTable of Contents under . copy默认configurationfile并formodify:

# 进入configurationTable of Contents
$ cd $SPARK_HOME/conf

# copy默认configurationfile
$ cp spark-env.sh.template spark-env.sh
$ cp log4j.properties.template log4j.properties

# 编辑spark-env.shfile, 添加以 under  in 容
$ vi spark-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# configurationlog级别 for WARN, reducinglog输出
$ vi log4j.properties
# 将log4j.rootCategory 值 from INFO改 for WARN
log4j.rootCategory=WARN, console

2.3 启动Spark Shell

Spark Shell is a 交互式 Sparkenvironment, 可以用于Learning and testSpark API.

# 启动Spark Shell (Scalaversion) 
$ spark-shell

# 启动PySpark (Pythonversion) 
$ pyspark

# 启动Spark SQL Shell
$ spark-sql

2.4 testSpark Shell

in Spark Shellin执行 simple operation, verificationenvironment is 否正常:

# Scalaversionexample
// creation一个RDD
scala> val rdd = sc.parallelize(1 to 10)

// 计算RDD元素 总 and 
scala> rdd.sum()
res0: Double = 55.0

// Pythonversionexample
>>> rdd = sc.parallelize(range(1, 11))
>>> rdd.sum()
55

// Spark SQLexample
spark-sql> CREATE TABLE test_table (id INT, name STRING);
spark-sql> INSERT INTO test_table VALUES (1, 'test');
spark-sql> SELECT * FROM test_table;

3. clusterEnvironment Setup

clusterenvironment适合produceenvironment or large-scaledataprocessing, 需要 many 台机器协同工作. 以 under is 搭建Spark Standalonecluster 步骤:

3.1 clusterplanning

fake设我们 has 3台机器, planningsuch as under :

Masternode: spark-master (IP: 192.168.1.100)
Workernode1: spark-worker1 (IP: 192.168.1.101)
Workernode2: spark-worker2 (IP: 192.168.1.102)

3.2 configurationSSH免passwordlogin

in 所 has node之间configurationSSH免passwordlogin, 以便Spark可以 in node之间通信:

#  in Masternode on 生成SSHkey for 
$ ssh-keygen -t rsa -P ""

# 将公钥copy to 所 has Workernode
$ ssh-copy-id spark-worker1
$ ssh-copy-id spark-worker2

# verificationSSH免passwordlogin
$ ssh spark-worker1

3.3 configurationMasternode

in Masternode on configurationSparkenvironment:

# 编辑spark-env.shfile
$ vi $SPARK_HOME/conf/spark-env.sh

# 添加以 under  in 容
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_MASTER_HOST=spark-master
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_WEBUI_PORT=8081

# creationworkersfile, 添加所 has Workernode 主机名
$ vi $SPARK_HOME/conf/workers
spark-worker1
spark-worker2

3.4 configurationWorkernode

in 所 has Workernode on 执行相同 configuration步骤:

# copyMasternode configuration to 所 has Workernode
$ scp -r $SPARK_HOME/conf spark-worker1:$SPARK_HOME/
$ scp -r $SPARK_HOME/conf spark-worker2:$SPARK_HOME/

3.5 启动Sparkcluster

in Masternode on 启动Sparkcluster:

# 启动Sparkcluster
$ $SPARK_HOME/sbin/start-all.sh

# 查看clusterstatus
$ $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://spark-master:7077 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100

3.6 访问Spark Web UI

through浏览器访问Spark Web UI, 查看clusterstatus:

Master Web UI: http://spark-master:8080
Worker Web UI: http://spark-worker1:8081, http://spark-worker2:8081
Application Web UI: http://spark-master:4040 (runapplication程序时可用)

3.7 停止Sparkcluster

in Masternode on 停止Sparkcluster:

# 停止Sparkcluster
$ $SPARK_HOME/sbin/stop-all.sh

# 单独停止Masternode
$ $SPARK_HOME/sbin/stop-master.sh

# 单独停止Workernode
$ $SPARK_HOME/sbin/stop-worker.sh

4. in YARN on deploymentSpark

除了Standalone模式 out , Spark还可以deployment in YARN on . 以 under is in YARN on deploymentSpark 步骤:

4.1 configurationHadoopenvironment

确保Hadoopenvironment已经正确configuration, 并且YARNservice正 in run:

# checkYARNservicestatus
$ yarn node -list

# such as果YARN未启动, 启动Hadoopservice
$ start-dfs.sh
$ start-yarn.sh

4.2 configurationSpark

in spark-env.shfilein添加HadoopconfigurationTable of Contents:

# 编辑spark-env.shfile
$ vi $SPARK_HOME/conf/spark-env.sh

# 添加HADOOP_CONF_DIR
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

4.3 submittingSparkjob to YARN

usingspark-submitcommands将jobsubmitting to YARN:

# 以YARN客户端模式submittingjob
$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100

# 以YARNcluster模式submittingjob
$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100

5. in Kubernetes on deploymentSpark

Spark也可以deployment in Kubernetescluster on . 以 under is basic步骤:

5.1 准备Kubernetesenvironment

确保Kubernetescluster已经正确configuration, 并且kubectlcommands可以正常using:

# checkKubernetesclusterstatus
$ kubectl cluster-info

# checknodestatus
$ kubectl get nodes

5.2 构建Spark Docker镜像

Sparkproviding了构建Docker镜像脚本:

# 构建Docker镜像
$ cd $SPARK_HOME
$ ./bin/docker-image-tool.sh -r myregistry -t v3.5.0 build

5.3 submittingSparkjob to Kubernetes

usingspark-submitcommands将jobsubmitting to Kubernetes:

# submittingjob to Kubernetes
$ spark-submit --master k8s://https://kubernetes-master:6443 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=2 \
  --conf spark.kubernetes.container.image=myregistry/spark:v3.5.0 \
  --conf spark.kubernetes.namespace=spark \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar 100

6. Sparkenvironmentvariable and configuration

Spark configuration可以throughenvironmentvariable, configurationfile or commands行parameterfor设置. 以 under is 一些常用 configuration项:

6.1 environmentvariable

JAVA_HOME: JavainstallationTable of Contents
SPARK_HOME: SparkinstallationTable of Contents
SPARK_MASTER_HOST: Masternode主机名
SPARK_MASTER_PORT: Masternode端口, 默认7077
SPARK_WORKER_CORES: Workernode可用核数
SPARK_WORKER_MEMORY: Workernode可用memory
HADOOP_CONF_DIR: HadoopconfigurationTable of Contents

6.2 configurationfile

spark-env.sh: 设置environmentvariable and 启动parameter
spark-defaults.conf: 设置Spark默认configuration项
workers: 设置Workernodelist
log4j.properties: 设置log级别 and 格式

6.3 常用configuration项

#  in spark-defaults.confin添加以 under configuration
spark.master                     yarn
spark.driver.memory              1g
spark.executor.memory            2g
spark.executor.cores             2
spark.sql.shuffle.partitions     200
spark.default.parallelism        100
spark.serializer                 org.apache.spark.serializer.KryoSerializer

实践练习

练习1: 单机Environment Setup

under 载并installationSpark单机版
configurationenvironmentvariable
启动Spark Shell并执行 simple operation
启动PySpark并执行 simple operation

练习2: clusterEnvironment Setup

准备3台虚拟机 or server
configurationSSH免passwordlogin
configurationSparkcluster
启动Sparkcluster
访问Spark Web UI
submittingSparkjob to cluster

练习3: YARN模式deployment

installation并configurationHadoopenvironment
configurationSpark and YARN集成
以YARN客户端模式submittingjob
以YARNcluster模式submittingjob

练习4: configurationoptimization

调整Spark log级别
configurationSpark memory and CPUresource
configurationSpark 序列化方式
configurationSpark SQL shufflepartition数

7. commonissues及solution

in 搭建Sparkenvironment过程in, 可能会遇 to 一些commonissues. 以 under is 一些issues solution:

7.1 Javaversion不兼容

issues: 启动Spark时提示Javaversion不兼容.
solution: 确保installation Javaversion符合Spark 要求, 建议usingJDK 8 or 更 high version.

7.2 端口被占用

issues: 启动Spark时提示端口被占用.
solution: modifyspark-env.shfilein 端口configuration, using未被占用端口.

7.3 Workernode无法连接 to Master

issues: Workernode无法连接 to Masternode.
solution:

checkMasternode 主机名 and 端口configuration is 否正确
checknetwork连接 is 否正常
check防火墙 is 否开放了相关端口

7.4 memory不足

issues: 启动Sparkapplication程序时提示memory不足.
solution:

增加机器物理memory
调整Spark memoryconfiguration, reducingdriver or executor memory分配
reducingparallel度, 降 low memoryusing

8. summarized

本tutorial详细介绍了SparkEnvironment Setup 各种方式, including单机environment, Standalonecluster, YARN模式 and Kubernetes模式. 每种deployment模式都 has 其适用场景, 需要根据practicalrequirements选择合适 deployment方式.

in 搭建Sparkenvironment时, 需要注意以 under 几点:

确保Javaversion符合要求
正确configurationenvironmentvariable
合理configurationmemory and CPUresource
注意端口conflictissues
configuration适当 log级别

through本tutorial Learning, 您应该able to成功搭建Sparkenvironment, 并开始usingSparkfor big dataprocessing. in after 续tutorialin, 我们将深入LearningSpark core concepts and advanced features, MasterSpark 各种application场景 and best practices.

on 一课: Spark Basics and architecture 返回tutoriallist under 一课: RDD programmingmodel