Spark Environment Setup
1. Environment Setup before 准备
in 搭建Sparkenvironment之 before , 需要先准备 good 所需 硬件 and 软件environment. 以 under is 搭建Sparkenvironment basic要求:
1.1 硬件要求
- 单机environment: 至 few 4GBmemory, 2核CPU, 50GB硬盘空间
- clusterenvironment:
- Masternode: 至 few 8GBmemory, 4核CPU, 100GB硬盘空间
- Workernode: 至 few 4GBmemory, 2核CPU, 50GB硬盘空间
- network: 所 has node之间需要 has stable network连接, 建议using千兆以太网
1.2 软件要求
- operationsystem: Linux (推荐) , Windows, macOS
- Java: JDK 8 or 更 high version
- Hadoop: (可选) Hadoop 2.7 or 更 high version, 用于HDFS and YARN集成
- Python: (可选) Python 3.6 or 更 high version, 用于PySpark
- Scala: (可选) Scala 2.12 or 更 high version, 用于ScalaDevelopment
1.3 under 载Spark
from Apache Spark官网 under 载最 new version Sparkinstallationpackage. 建议选择带 has Hadoopsupport 预编译version, 以简化 and Hadoop 集成.
# under 载Sparkinstallationpackage (以Spark 3.5.0 for 例) $ wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz # 解压installationpackage $ tar -xzvf spark-3.5.0-bin-hadoop3.tgz # move to 指定Table of Contents $ mv spark-3.5.0-bin-hadoop3 /opt/spark
2. 单机Environment Setup
单机environment is Learning and DevelopmentSparkapplication程序 Basicsenvironment, configuration simple , 适合初学者.
2.1 configurationenvironmentvariable
编辑/etc/profilefile or user .bashrcfile, 添加Spark environmentvariable:
# 打开.bashrcfile $ vi ~/.bashrc # 添加以 under in 容 export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin export PYSPARK_PYTHON=/usr/bin/python3 # 使environmentvariable生效 $ source ~/.bashrc
2.2 configurationSpark
Spark configurationfile位于$SPARK_HOME/confTable of Contents under . copy默认configurationfile并formodify:
# 进入configurationTable of Contents $ cd $SPARK_HOME/conf # copy默认configurationfile $ cp spark-env.sh.template spark-env.sh $ cp log4j.properties.template log4j.properties # 编辑spark-env.shfile, 添加以 under in 容 $ vi spark-env.sh export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # configurationlog级别 for WARN, reducinglog输出 $ vi log4j.properties # 将log4j.rootCategory 值 from INFO改 for WARN log4j.rootCategory=WARN, console
2.3 启动Spark Shell
Spark Shell is a 交互式 Sparkenvironment, 可以用于Learning and testSpark API.
# 启动Spark Shell (Scalaversion) $ spark-shell # 启动PySpark (Pythonversion) $ pyspark # 启动Spark SQL Shell $ spark-sql
2.4 testSpark Shell
in Spark Shellin执行 simple operation, verificationenvironment is 否正常:
# Scalaversionexample // creation一个RDD scala> val rdd = sc.parallelize(1 to 10) // 计算RDD元素 总 and scala> rdd.sum() res0: Double = 55.0 // Pythonversionexample >>> rdd = sc.parallelize(range(1, 11)) >>> rdd.sum() 55 // Spark SQLexample spark-sql> CREATE TABLE test_table (id INT, name STRING); spark-sql> INSERT INTO test_table VALUES (1, 'test'); spark-sql> SELECT * FROM test_table;
3. clusterEnvironment Setup
clusterenvironment适合produceenvironment or large-scaledataprocessing, 需要 many 台机器协同工作. 以 under is 搭建Spark Standalonecluster 步骤:
3.1 clusterplanning
fake设我们 has 3台机器, planningsuch as under :
- Masternode: spark-master (IP: 192.168.1.100)
- Workernode1: spark-worker1 (IP: 192.168.1.101)
- Workernode2: spark-worker2 (IP: 192.168.1.102)
3.2 configurationSSH免passwordlogin
in 所 has node之间configurationSSH免passwordlogin, 以便Spark可以 in node之间通信:
# in Masternode on 生成SSHkey for $ ssh-keygen -t rsa -P "" # 将公钥copy to 所 has Workernode $ ssh-copy-id spark-worker1 $ ssh-copy-id spark-worker2 # verificationSSH免passwordlogin $ ssh spark-worker1
3.3 configurationMasternode
in Masternode on configurationSparkenvironment:
# 编辑spark-env.shfile $ vi $SPARK_HOME/conf/spark-env.sh # 添加以 under in 容 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export SPARK_MASTER_HOST=spark-master export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=8080 export SPARK_WORKER_CORES=2 export SPARK_WORKER_MEMORY=4g export SPARK_WORKER_WEBUI_PORT=8081 # creationworkersfile, 添加所 has Workernode 主机名 $ vi $SPARK_HOME/conf/workers spark-worker1 spark-worker2
3.4 configurationWorkernode
in 所 has Workernode on 执行相同 configuration步骤:
# copyMasternode configuration to 所 has Workernode $ scp -r $SPARK_HOME/conf spark-worker1:$SPARK_HOME/ $ scp -r $SPARK_HOME/conf spark-worker2:$SPARK_HOME/
3.5 启动Sparkcluster
in Masternode on 启动Sparkcluster:
# 启动Sparkcluster $ $SPARK_HOME/sbin/start-all.sh # 查看clusterstatus $ $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://spark-master:7077 $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100
3.6 访问Spark Web UI
through浏览器访问Spark Web UI, 查看clusterstatus:
- Master Web UI: http://spark-master:8080
- Worker Web UI: http://spark-worker1:8081, http://spark-worker2:8081
- Application Web UI: http://spark-master:4040 (runapplication程序时可用)
3.7 停止Sparkcluster
in Masternode on 停止Sparkcluster:
# 停止Sparkcluster $ $SPARK_HOME/sbin/stop-all.sh # 单独停止Masternode $ $SPARK_HOME/sbin/stop-master.sh # 单独停止Workernode $ $SPARK_HOME/sbin/stop-worker.sh
4. in YARN on deploymentSpark
除了Standalone模式 out , Spark还可以deployment in YARN on . 以 under is in YARN on deploymentSpark 步骤:
4.1 configurationHadoopenvironment
确保Hadoopenvironment已经正确configuration, 并且YARNservice正 in run:
# checkYARNservicestatus $ yarn node -list # such as果YARN未启动, 启动Hadoopservice $ start-dfs.sh $ start-yarn.sh
4.2 configurationSpark
in spark-env.shfilein添加HadoopconfigurationTable of Contents:
# 编辑spark-env.shfile $ vi $SPARK_HOME/conf/spark-env.sh # 添加HADOOP_CONF_DIR export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
4.3 submittingSparkjob to YARN
usingspark-submitcommands将jobsubmitting to YARN:
# 以YARN客户端模式submittingjob $ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode client $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100 # 以YARNcluster模式submittingjob $ spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar 100
5. in Kubernetes on deploymentSpark
Spark也可以deployment in Kubernetescluster on . 以 under is basic步骤:
5.1 准备Kubernetesenvironment
确保Kubernetescluster已经正确configuration, 并且kubectlcommands可以正常using:
# checkKubernetesclusterstatus $ kubectl cluster-info # checknodestatus $ kubectl get nodes
5.2 构建Spark Docker镜像
Sparkproviding了构建Docker镜像 脚本:
# 构建Docker镜像 $ cd $SPARK_HOME $ ./bin/docker-image-tool.sh -r myregistry -t v3.5.0 build
5.3 submittingSparkjob to Kubernetes
usingspark-submitcommands将jobsubmitting to Kubernetes:
# submittingjob to Kubernetes $ spark-submit --master k8s://https://kubernetes-master:6443 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=2 \ --conf spark.kubernetes.container.image=myregistry/spark:v3.5.0 \ --conf spark.kubernetes.namespace=spark \ local:///opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar 100
6. Sparkenvironmentvariable and configuration
Spark configuration可以throughenvironmentvariable, configurationfile or commands行parameterfor设置. 以 under is 一些常用 configuration项:
6.1 environmentvariable
- JAVA_HOME: JavainstallationTable of Contents
- SPARK_HOME: SparkinstallationTable of Contents
- SPARK_MASTER_HOST: Masternode主机名
- SPARK_MASTER_PORT: Masternode端口, 默认7077
- SPARK_WORKER_CORES: Workernode可用核数
- SPARK_WORKER_MEMORY: Workernode可用memory
- HADOOP_CONF_DIR: HadoopconfigurationTable of Contents
6.2 configurationfile
- spark-env.sh: 设置environmentvariable and 启动parameter
- spark-defaults.conf: 设置Spark默认configuration项
- workers: 设置Workernodelist
- log4j.properties: 设置log级别 and 格式
6.3 常用configuration项
# in spark-defaults.confin添加以 under configuration spark.master yarn spark.driver.memory 1g spark.executor.memory 2g spark.executor.cores 2 spark.sql.shuffle.partitions 200 spark.default.parallelism 100 spark.serializer org.apache.spark.serializer.KryoSerializer
实践练习
练习1: 单机Environment Setup
- under 载并installationSpark单机版
- configurationenvironmentvariable
- 启动Spark Shell并执行 simple operation
- 启动PySpark并执行 simple operation
练习2: clusterEnvironment Setup
- 准备3台虚拟机 or server
- configurationSSH免passwordlogin
- configurationSparkcluster
- 启动Sparkcluster
- 访问Spark Web UI
- submittingSparkjob to cluster
练习3: YARN模式deployment
- installation并configurationHadoopenvironment
- configurationSpark and YARN集成
- 以YARN客户端模式submittingjob
- 以YARNcluster模式submittingjob
练习4: configurationoptimization
- 调整Spark log级别
- configurationSpark memory and CPUresource
- configurationSpark 序列化方式
- configurationSpark SQL shufflepartition数
7. commonissues及solution
in 搭建Sparkenvironment过程in, 可能会遇 to 一些commonissues. 以 under is 一些issues solution:
7.1 Javaversion不兼容
issues: 启动Spark时提示Javaversion不兼容.
solution: 确保installation Javaversion符合Spark 要求, 建议usingJDK 8 or 更 high version.
7.2 端口被占用
issues: 启动Spark时提示端口被占用.
solution: modifyspark-env.shfilein 端口configuration, using未被占用 端口.
7.3 Workernode无法连接 to Master
issues: Workernode无法连接 to Masternode.
solution:
- checkMasternode 主机名 and 端口configuration is 否正确
- checknetwork连接 is 否正常
- check防火墙 is 否开放了相关端口
7.4 memory不足
issues: 启动Sparkapplication程序时提示memory不足.
solution:
- 增加机器 物理memory
- 调整Spark memoryconfiguration, reducingdriver or executor memory分配
- reducingparallel度, 降 low memoryusing
8. summarized
本tutorial详细介绍了SparkEnvironment Setup 各种方式, including单机environment, Standalonecluster, YARN模式 and Kubernetes模式. 每种deployment模式都 has 其适用场景, 需要根据practicalrequirements选择合适 deployment方式.
in 搭建Sparkenvironment时, 需要注意以 under 几点:
- 确保Javaversion符合要求
- 正确configurationenvironmentvariable
- 合理configurationmemory and CPUresource
- 注意端口conflictissues
- configuration适当 log级别
through本tutorial Learning, 您应该able to成功搭建Sparkenvironment, 并开始usingSparkfor big dataprocessing. in after 续tutorialin, 我们将深入LearningSpark core concepts and advanced features, MasterSpark 各种application场景 and best practices.