values columns Sorting pyspark

2019-08-18 04:04发布

问题:

I have this DataFrame bellow:

Ref °     | Indice_1 | Indice_2      | 1    |   2   |  indice_from     |    indice_from      |      indice_to    |  indice_to  
---------------------------------------------------------------------------------------------------------------------------------------------
1         |   19     |   37.1        |  32       |    62      |  ["20031,10031"]  |   ["13,11/12"]     |     ["40062,30062"] |  ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
2         |   19     |   37.1        |  44       |    12      |  ["40062,30062"]  |   ["13,11/12"]     |     ["40062,30062"] |  ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
3         |   19     |   37.1        |  22       |    64      |  ["20031,10031"]  |   ["13,11/12"]       |     ["20031,10031"] |  ["13,11/12"]
---------------------------------------------------------------------------------------------------------------------------------------------
4         |   19     |   37.1        |  32       |    98      |  ["20032,10032"]  |   ["13,11/12"]     |     ["40062,30062"] |  ["13,11/12"]

I want sort asc the values of the column indice_from, indice_from, indice_to, and indice_to and I shouldn't touch the rest of the columns of my DataFrame. Knowing that the 2 columns indice_from and indice_to contains some times a number + letter like: ["14,14A"] In case if I have an example like ["14,14A"], always I should have the same structure, for example if I have:

The number 15, the second value should 15 + letter, and 15 < 15 + letter, if first value is 9, the second value should 9 + letter and 9 < 9+letter

New DataFrame:

Ref °     | Indice_1 | Indice_2      | 1    |   2   |  indice_from     |    indice_from      |      indice_to     |  indice_to  
---------------------------------------------------------------------------------------------------------------------------------------------
1         |   19     |   37.1        |  32       |    62      |  ["10031,20031"]  |   ["11/12,13"]       |     ["30062,40062"] |  ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
2         |   19     |   37.1        |  44       |    12      |  ["30062,40062"]  |   ["11/12,13"]       |     ["30062,40062"] |  ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
3         |   19     |   37.1        |  22       |    64      |  ["10031,20031"]  |   ["11/12,13"]       |     ["10031,20031"] |  ["11/12,13"]
---------------------------------------------------------------------------------------------------------------------------------------------
4         |   19     |   37.1        |  32       |    98      |  ["10031,20031"]  |   ["11/12,13"]       |     ["30062,40062"] |  ["11/12,13"]

Someone please can help how can I sort the values of columns indice_from, indice_from, indice_to, and indice_to to obtain new Dataframe like the second df above ? Thank you

回答1:

If I understand it correctly then

from pyspark.sql import functions as F

columns_to_sort = ['indice_from', 'indice_from', 'indice_to', 'indice_to']

for c in columns_to_sort:
    df = (
        df
        .withColumn(
            c,
            F.sort_array(c)
        )
    )

will do the trick. Let me know if it doesn't