Add column from one dataframe to another WITHOUT J

2020-05-08 12:27发布

Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows

Let's say I have a dataframe of M features id, salary, age, etc.

+----+--------+------------+--------------+
| id | salary | age | zone |  ....  
+----+--------+------------+--------------+

I have perform certain operations on each feature to arrive at something like this

+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary | 
+----+--------+------------+--------------+------------+--------------+--------------+--------------+

Each feature is processed independently, with the same list of rows

+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301  | x1     | x          | x            | x |
| 302  | null   | x          | x            | x |
| 303  | x3     | x          | x            | x |

+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age 
+----+--------+------------+--------------+
| 301  | null   | x          | x   
| 302  | x2     | x          | x   
| 303  | x3     | x          | x   

In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector

| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age

I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.

JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.

How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.

0条回答
登录 后发表回答