PySpark Serializers
One of the habits I’ve formed using PySpark is to have a look at the source code, not just to get a better feel for what’s changed in a new release but also to supplement / supplant the documentation. For instance, if you peruse the docs for serialization, you might be left with the impression that your choices are either fast but limited datatype support or always-works but slow, when there’s actually quite a few serializers available. In fact you can get the best of both worlds if you use the AutoSerializer
, which defaults to MarshalSerializer
if supported for faster operations but falls back to PickleSerializer
when required. I’ve had good results combining AutoSerializer
with CompressedSerializer
, which compresses / decompresses the data on the fly:
import pyspark
from pyspark.serializers import CompressedSerializer, AutoSerializer
sc = pyspark.SparkContext(conf=config, serializer=CompressedSerializer(AutoSerializer())
As always with [un-/under-]documented API features you should be cognizant of the fact that they could disappear without any warning, but that said these serializers are used elsewhere in PySpark so you’re probably good…for a while anyway. 🙂