How to find an optimized join between 2 different

2020-07-30 00:44发布

问题:

I have a 2 different datasets, I would like to join them, but there is no easy way to do it because they don't have a common column and the crossJoin not good solution when we use a bigdata. I already asked the question on stackoverflow, but really I couldn't find an optimized solution to join them. My question on stackoverflow is: looking if String contain a sub-string in differents Dataframes

I saw these solution bellow but I didn't find a good way for my case. Efficient string suffix detection Efficient string suffix detection Efficient string matching in Apache Spark

Today, I found a funny solution :) I'm not sure if it will be work, but let's try.

I add a new column in df_1 to be contain numbering of lines.

Example df_1:

name    | id
----------------
abc     | 1232
----------------
azerty  | 87564
----------------
google  | 374856
----------------

new df_1:

name    | id       | new_id
----------------------------
abc     | 1232     |  1
----------------------------
azerty  | 87564    |  2
----------------------------
google  | 374856   |  3
----------------------------
explorer| 84763    |  4
----------------------------

The same for df_2:

Example df_2:

adress    |
-----------
UK        |
-----------
USA       |
-----------
EUROPE    |
-----------

new df_2:

adress    | new_id
-------------------
UK        |   1
-------------------
USA       |   2
-------------------
EUROPE    |   3
-------------------

Now, I have a common column between the 2 dataframes, I can do a left join using a new_id as key. My question, is this solution efficient ? How can I add new_id columns in each dataframe with numbering of line ?

回答1:

As the Spark is Lazy Evaluation ,it means that the execution will not start until an action is triggered . So what you can do is simply call spark context createdataframe function and pass list of selected columns from df1 and df2 . It will create a new dataframe as you need.

e.g. df3 = spark.createDataframe([df1.select(''),df2.select('')])

Upvote if works