Spark

Spark Tuning and Cluster Sizing

Whether dynamic allocation is on, even with dynamic allocation, executor size needs to be determined
Some core settings
- spark.driver.memory
- spark.executor.memory
- spark.executor.cores
- driver core = 1 in client mode, core can be set in cluster mode
Memory Overhead
- executor overhead = spark.yarn.executor.memoryOverhead
- driver (cluster mode) = spark.yarn.driver.memoryOverhead
- driver (client mode) = spark.yarn.am.memoryOverhead
If too much data collect to driver or large local computations, increase spark.driver.maxResultSize also
5 cores per executor seems to be optimal
Within executot memory, around 25% is reserved for Spark internal meta data and user data structures, M = spark.executor.memory * spark.memory.fraction is used for caching and execution
- All cached data must fit in R = spark.executor.memory * spark.memory.fraction * spark.memory.storageFraction
number of tasks can be ran at the same time = # of cores * # of executors

Interactive mode: take(1) or count
Understand driver or executor log to look at
Try using explicit functions instead of anonymous inner functions
Driver out-of-memory exception: be weary of collect statements, countByKey or some models in Spark ML or MLlib
Config and access logging for more information
Run Spark in local mode and attach debuggers

General Spark Unit Testing
- Factor code for testability: avoid using Scala’s anonymous function
- Regular Spark Job: create RDD and parallelize and apply transformation
- Streaming
Getting Test Data: mllib RandomRDDs
Verifying Performance: SparkListener, spark-perf package

Spark MLlib and Spark ML
Spark Streaming: DStreams for RDD and Structured Streaming based on Spark SQL/DataFrames, long-running job needs to handle driver failure or checkpointing (HA)
GraphX: GraphFrames