PySpark - Development Environment with Dev Container
It’s hard to setup the Spark cluster to run PySpark job locally or CI system. The following github repository will show how we can use dev container feature in VS code to setup Spark cluster in container environment and running PySpark jobs in unit tests.
We can load a file from a mounted directory and perform any operation with Spark dataframe.
1
2
3
4
5
6
7
8
9
10
spark = SparkSession \
.builder \
.master("spark://spark:7077") \
.config("spark.driver.host", "pyspark-app") \
.appName('pyspark-app') \
.getOrCreate()
df = spark.read.csv("/mounted-data/src/unittest/data/crash_catalonia.csv")
row_count = df.count();