If too much data collect to driver or large local computations, increase spark.driver.maxResultSize also
5 cores per executor seems to be optimal
Within executot memory, around 25% is reserved for Spark internal meta data and user data structures, M = spark.executor.memory * spark.memory.fraction is used for caching and execution
All cached data must fit in R = spark.executor.memory * spark.memory.fraction * spark.memory.storageFraction
number of tasks can be ran at the same time = # of cores * # of executors
Serilization Options
RDDs: Java serialization or Kryo serialization
DataFrames/Datasets: Tungsten-based serialization
Aditional Debugging Techniques
Interactive mode: take(1) or count
Understand driver or executor log to look at
Try using explicit functions instead of anonymous inner functions
Driver out-of-memory exception: be weary of collect statements, countByKey or some models in Spark ML or MLlib
Config and access logging for more information
Run Spark in local mode and attach debuggers
Testing and Validation
General Spark Unit Testing
Factor code for testability: avoid using Scala’s anonymous function
Regular Spark Job: create RDD and parallelize and apply transformation
Spark Streaming: DStreams for RDD and Structured Streaming based on Spark SQL/DataFrames, long-running job needs to handle driver failure or checkpointing (HA)