Is it safe to pipe the output of several parallel

I'm scraping data from the web, and I have several processes of my scraper running in parallel.

I want the output of each of these processes to end up in the same file. As long as lines of text remain intact and don't get mixed up with each other, the order of the lines does not matter. In UNIX, can I just pipe the output of each process to the same file using the >> operator?

标签： unix concurrency parallel-processing pipe

9条回答

相关推荐>>

2楼-- · 2020-02-18 04:56

In addition to the idea of using temporary files, you could also use some kind of aggregating process, although you would still need to make sure your writes are atomic.

Think Apache2 with piped logging (with something like spread on the other end of the pipe if you're feeling ambitious). That's the approach it takes, with multiple threads/processes sharing a single logging process.

0人赞添加讨论(0) 举报

【Aperson】

3楼-- · 2020-02-18 04:58

Use temporary files and concatenate them together. It's the only safe way to do what you want to do, and there will (probably) be negligible performance loss that way. If performance is really a problem, try making sure that your /tmp directory is a RAM-based filesystem and putting your temporary files there. That way the temporary files are stored in RAM instead of on a hard drive, so reading/writing them is near-instant.

0人赞添加讨论(0) 举报

Anthone

4楼-- · 2020-02-18 05:00

You'll need to ensure that you're writing whole lines in single write operations (so if you're using some form of stdio, you'll need to set it for line buffering for at least the length of the longest line that you can output.) Since the shell uses O_APPEND for the >> redirection then all your writes will then automatically append to the file with no further action on your part.

0人赞添加讨论(0) 举报

Juvenile、少年°

5楼-- · 2020-02-18 05:02

Generally, no.

On Linux this might be possible, as long as two conditions are met: each line is written in a one operation, and the line is no longer than PIPE_SIZE (usually the same as PAGE_SIZE, usually 4096). But... I wouldn't count on that; this behaviour might change.

It is better to use some kind of real logging mechanism, like syslog.

0人赞添加讨论(0) 举报

孤傲高冷的网名

6楼-- · 2020-02-18 05:06

One possibly interesting thing you could do is use gnu parallel: http://www.gnu.org/s/parallel/ For example if you you were spidering the sites:

stackoverflow.com, stackexchange.com, fogcreek.com

you could do something like this

(echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k your_spider_script

and the output is buffered by parallel and because of the -k option returned to you in the order of the site list above. A real example (basically copied from the 2nd parallel screencast):

 ~ $ (echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k ping -c 1 {}


PING stackoverflow.com (64.34.119.12): 56 data bytes

--- stackoverflow.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING stackexchange.com (64.34.119.12): 56 data bytes

--- stackexchange.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING fogcreek.com (64.34.80.170): 56 data bytes
64 bytes from 64.34.80.170: icmp_seq=0 ttl=250 time=23.961 ms

--- fogcreek.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 23.961/23.961/23.961/0.000 ms

Anyway, ymmv

0人赞添加讨论(0) 举报

趁早两清

7楼-- · 2020-02-18 05:09

Definitely no, I had a log-management script where I assumed this worked, and it did work, until I moved it to an under-load production server. Not a good day... But basically you end up with sometimes completely mixed up lines.

If I'm trying to capture from multiple sources, it is much simpler (and easier to debug) having a multiple-file 'paper trails' and if I need an over-all log file, concatenate based on timestamp (you are using time-stamps, right?) or as liori said, syslog.

0人赞添加讨论(0) 举报

1 2 下一页

Is it safe to pipe the output of several parallel

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间