Add column from one dataframe to another WITHOUT J

Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows

Let's say I have a dataframe of M features id, salary, age, etc.

+----+--------+------------+--------------+
| id | salary | age | zone |  ....  
+----+--------+------------+--------------+

I have perform certain operations on each feature to arrive at something like this

+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary | 
+----+--------+------------+--------------+------------+--------------+--------------+--------------+

Each feature is processed independently, with the same list of rows

+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301  | x1     | x          | x            | x |
| 302  | null   | x          | x            | x |
| 303  | x3     | x          | x            | x |

+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age 
+----+--------+------------+--------------+
| 301  | null   | x          | x   
| 302  | x2     | x          | x   
| 303  | x3     | x          | x

In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector

| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age

I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.

JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.

How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.

标签： python apache-spark dataframe join pyspark

0条回答

Add column from one dataframe to another WITHOUT J

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间