How could I add a column to a DataFrame in Pyspark

2019-06-17 01:41发布

I have a DataFrame called 'df' like the following:

+-------+-------+-------+
|  Atr1 |  Atr2 |  Atr3 |
+-------+-------+-------+
|   A   |   A   |   A   |
+-------+-------+-------+
|   B   |   A   |   A   |
+-------+-------+-------+
|   C   |   A   |   A   |
+-------+-------+-------+

I want to add a new column to it with incremental values and get the following updated DataFrame:

+-------+-------+-------+-------+
|  Atr1 |  Atr2 |  Atr3 |  Atr4 |
+-------+-------+-------+-------+
|   A   |   A   |   A   |   1   |
+-------+-------+-------+-------+
|   B   |   A   |   A   |   2   |
+-------+-------+-------+-------+
|   C   |   A   |   A   |   3   |
+-------+-------+-------+-------+

How could I get it?

标签： python dataframe attributes pyspark increment

1条回答

Luminary・发光体

2楼-- · 2019-06-17 02:36

If you only need incremental values (like an ID) and don't have any constraint that the numbers need to be consecutive, you could use monotonically_increasing_id(). The only guarantee when using this function is that the values will be increasing for each row, however, the values themself can differ each execution.

from pyspark.sql.functions import monotonically_increasing_id

df.withColumn("Atr4", monotonically_increasing_id())

0人赞添加讨论(0) 举报

How could I add a column to a DataFrame in Pyspark

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间