PySpark - groupby then concatenate values to list

less than 1 minute read

PySpark example to groupby then concatenate values to list in a separated column.

# 1. Create the DF

df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store","values"])

+-----+---------+
|store|   values|
+-----+---------+
|    1|[1, 2, 3]|
|    1|[4, 5, 6]|
|    2|      [2]|
|    2|      [3]|
+-----+---------+

# 2. Group by store

df = df.groupBy("store").agg(F.collect_list("values"))

+-----+--------------------+
|store|collect_list(values)|
+-----+--------------------+
|    1|[[1, 2, 3], [4, 5...|
|    2|          [[2], [3]]|
+-----+--------------------+

# 3. finally.... flat the array

df = df.withColumn("flatten_array", F.flatten("collect_list(values)"))

+-----+--------------------+------------------+
|store|collect_list(values)|     flatten_array|
+-----+--------------------+------------------+
|    1|[[1, 2, 3], [4, 5...|[1, 2, 3, 4, 5, 6]|
|    2|          [[2], [3]]|            [2, 3]|

Share on

Twitter Facebook LinkedIn

Nam Seob Seo

PySpark - groupby then concatenate values to list

Share on

You may also enjoy

Large language model from scratch

C++ - Variadic template expression

MLP, CNN and RNN

LLM - key terms