PySpark - groupby then concatenate values to list

less than 1 minute read

PySpark example to groupby then concatenate values to list in a separated column.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# 1. Create the DF

df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store","values"])

+-----+---------+
|store|   values|
+-----+---------+
|    1|[1, 2, 3]|
|    1|[4, 5, 6]|
|    2|      [2]|
|    2|      [3]|
+-----+---------+

# 2. Group by store

df = df.groupBy("store").agg(F.collect_list("values"))

+-----+--------------------+
|store|collect_list(values)|
+-----+--------------------+
|    1|[[1, 2, 3], [4, 5...|
|    2|          [[2], [3]]|
+-----+--------------------+

# 3. finally.... flat the array

df = df.withColumn("flatten_array", F.flatten("collect_list(values)"))

+-----+--------------------+------------------+
|store|collect_list(values)|     flatten_array|
+-----+--------------------+------------------+
|    1|[[1, 2, 3], [4, 5...|[1, 2, 3, 4, 5, 6]|
|    2|          [[2], [3]]|            [2, 3]|

Categories:

Updated: