I’ve been working on an application with PySpark and Kafka lately, and I’ve run into issues trying to push NumPy and Pandas objects around in Spark. PySpark uses Py4J to connect Python to the underlying JVM, and Py4J doesn’t serialize NumPy arrays, data types, etc. for performance reasons (IIRC). There’s probably a better way to handle this but a quick solution is to just convert the “un-serializable” objects into their vanilla Python equivalents, which is what
to_primitive(some_unserializable_object) before you start tossing it around in Spark.