PySpark - groupby then concatenate values to list
PySpark example to groupby then concatenate values to list in a separated column.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# 1. Create the DF
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF(["store","values"])
+-----+---------+
|store| values|
+-----+---------+
| 1|[1, 2, 3]|
| 1|[4, 5, 6]|
| 2| [2]|
| 2| [3]|
+-----+---------+
# 2. Group by store
df = df.groupBy("store").agg(F.collect_list("values"))
+-----+--------------------+
|store|collect_list(values)|
+-----+--------------------+
| 1|[[1, 2, 3], [4, 5...|
| 2| [[2], [3]]|
+-----+--------------------+
# 3. finally.... flat the array
df = df.withColumn("flatten_array", F.flatten("collect_list(values)"))
+-----+--------------------+------------------+
|store|collect_list(values)| flatten_array|
+-----+--------------------+------------------+
| 1|[[1, 2, 3], [4, 5...|[1, 2, 3, 4, 5, 6]|
| 2| [[2], [3]]| [2, 3]|