PySpark - Development Environment with Dev Container

less than 1 minute read

It’s hard to setup the Spark cluster to run PySpark job locally or CI system. The following github repository will show how we can use dev container feature in VS code to setup Spark cluster in container environment and running PySpark jobs in unit tests.

We can load a file from a mounted directory and perform any operation with Spark dataframe.

1
2
3
4
5
6
7
8
9
10
spark = SparkSession \
            .builder \
            .master("spark://spark:7077") \
            .config("spark.driver.host", "pyspark-app") \
            .appName('pyspark-app') \
            .getOrCreate()


df = spark.read.csv("/mounted-data/src/unittest/data/crash_catalonia.csv")
row_count = df.count();

PySpark in Dev Container

Categories:

Updated: