I have a 2 different datasets,
I would like to join them, but there is no easy way to do it because they don't have a common column and the crossJoin not good solution when we use a bigdata. I already asked the question on stackoverflow, but really I couldn't find an optimized solution to join them. My question on stackoverflow is: looking if String contain a sub-string in differents Dataframes
I saw these solution bellow but I didn't find a good way for my case. Efficient string suffix detection Efficient string suffix detection Efficient string matching in Apache Spark
Today, I found a funny solution :) I'm not sure if it will be work, but let's try.
I add a new column in df_1
to be contain numbering of lines.
Example df_1:
name | id
----------------
abc | 1232
----------------
azerty | 87564
----------------
google | 374856
----------------
new df_1:
name | id | new_id
----------------------------
abc | 1232 | 1
----------------------------
azerty | 87564 | 2
----------------------------
google | 374856 | 3
----------------------------
explorer| 84763 | 4
----------------------------
The same for df_2:
Example df_2:
adress |
-----------
UK |
-----------
USA |
-----------
EUROPE |
-----------
new df_2:
adress | new_id
-------------------
UK | 1
-------------------
USA | 2
-------------------
EUROPE | 3
-------------------
Now, I have a common column between the 2 dataframes, I can do a left join using a new_id
as key
.
My question, is this solution efficient ?
How can I add new_id
columns in each dataframe with numbering of line ?