pyspark and HDFS commands

2019-02-10 21:15发布

站内文章 / Python

43 0

叼着烟拽天下

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I would like to do some cleanup at the start of my Spark program (Pyspark). For example, I would like to delete data from previous HDFS run. In pig this can be done using commands such as

fs -copyFromLocal ....

rmf /path/to-/hdfs

or locally using sh command.

I was wondering how to do the same with Pyspark.

回答1:

You can execute arbitrary shell command using form example subprocess.call or sh library so something like this should work just fine:

import subprocess

some_path = ...
subprocess.call(["hadoop", "fs", "-rm", "-f", some_path])

If you use Python 2.x you can try using spotify/snakebite:

from snakebite.client import Client

host = ...
port = ...
client = Client(host, port)
client.delete(some_path, recurse=True)

hdfs3 is yet another library which can be used to do the same thing:

from hdfs3 import HDFileSystem

hdfs = HDFileSystem(host=host, port=port)
HDFileSystem.rm(some_path)

标签： python apache-spark hdfs pyspark

叼着烟拽天下

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

pyspark and HDFS commands

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮