How to write / writeStream each row of a dataframe

2019-07-23 01:04发布

Each row of my dataframe has a CSV content.

I am strugling to save each row in a different and specific table.

I believe I need to use a foreach or UDF in order to accomplish this, but this is simply not working.

All the content I managed to find was just like simple prints inside foreachs or codes using .collect() (which I really don't want to use).

I also found the repartition way, but that doesn't allow me to choose where each row will go.

rows = df.count()
df.repartition(rows).write.csv('save-dir')

Can you give me a simple and working example of it?

3条回答
放荡不羁爱自由
2楼-- · 2019-07-23 01:18

Saving each row as a Table is a costly operation and not recommended. But what you are trying can be achieve like this -

df.write.format("delta").partitionBy("<primary-key-column>").save("/delta/save-dir")

Now each row will be saved as a .parquet format and you can create External table from each partition. This will only work if you have unique value for every row i.e. a primary key.

查看更多
对你真心纯属浪费
3楼-- · 2019-07-23 01:21

Did you tried .mode("append").repartionBy("ID"), it will create a directory for each ID, then don't forget to put the mode

查看更多
你好瞎i
4楼-- · 2019-07-23 01:31

Well, at the end of all, as always it is something very simple, but I dind't see this anywere.

Basically when you perform a foreach and the dataframe you want to save is built inside the loop. The worker unlike the driver, won't automatically setup the "/dbfs/" path on the saving, so if you don't manually add the "/dbfs/", it will save the data locally in the worker.

That is why my loops weren't working.

查看更多
登录 后发表回答