拆分大的文本文件分成更小的文本文件,通过使用Python行号(Splitting large tex

2019-09-01 10:42发布

我有一个文本文件说really_big_file.txt包含:

line 1
line 2
line 3
line 4
...
line 99999
line 100000

我想编写划分really_big_file.txt与各300线小文件的Python脚本。 例如,small_file_300.txt有行1-300,small_file_600有行301-600,依此类推,直到有制成含有从大文件中的所有行足够小的文件。

我希望在最简单的方法有什么建议来完成这个使用Python

Answer 1:

lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()


Answer 2:

使用itertools石斑鱼食谱:

from itertools import izip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

而不是存储在列表中的每一行这种方法的优点在于,它与iterables,一行行,所以它并没有存储每个small_file到内存中一次。

请注意,在这种情况下,最后一个文件将small_file_100200但只会去,直到line 100000 。 这是因为fillvalue='' ,意思是我写出来没有什么文件时,我没有任何更多的行左写,因为一组大小不平分。 您可以通过编写到一个临时文件,然后重命名它,而不是第一后命名它像我有解决这个问题。 下面是可以做的。

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

这一次的fillvalue=None我通过每行去检查None ,当它发生时,我知道这个过程已经完成,所以我减去1j不计填料,然后写入文件。



Answer 3:

import csv
import os
import re

MAX_CHUNKS = 300


def writeRow(idr, row):
    with open("file_%d.csv" % idr, 'ab') as file:
        writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
        writer.writerow(row)

def cleanup():
    for f in os.listdir("."):
        if re.search("file_.*", f):
            os.remove(os.path.join(".", f))

def main():
    cleanup()
    with open("large_file.csv", 'rb') as results:
        r = csv.reader(results, delimiter=',', quotechar='\"')
        idr = 1
        for i, x in enumerate(r):
            temp = i + 1
            if not (temp % (MAX_CHUNKS + 1)):
                idr += 1
            writeRow(idr, x)

if __name__ == "__main__": main()


Answer 4:

我这样做更容易理解的方式,为了给你如何以及为什么这个工程进一步的理解,使用短切少。 以前的答案工作,但如果你不熟悉某些内置的功能,你不会明白什么功能正在做什么。

因为你发布任何代码,我决定做这种方式,因为你可能不熟悉低于给定的基本Python语法其他的事情,你措辞的问题的方式使它看起来好像你没有尝试也没有任何线索,如何接近题

以下是基本的Python这样做的步骤:

首先,你应该读你的文件到妥善保管的列表:

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

其次,你需要设置的名字创建新文件的方法! 我建议有几个柜台沿着一个循环:

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

第三,循环里面,你需要一些嵌套的循环,将正确的行保存到一个数组:

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

最后一件事,又在你的第一个循环,你需要编写新的文件,并添加您的最后一个计数器增量所以你的循环将再次经历谱写新的文件

outer_count += 1
with open(file_name,'w') as next_file:
    for row in hold_new_lines:
        next_file.write(row)

注意:如果行数不整除300,最后一个文件将不对应于最后文件行的名称。

理解为什么这些循环的工作是很重要的。 你有它设置,以便下一次循环,因为你必须依赖于不断变化的变量名,你写更改文件的名称。 这是文件访问,开放,写作,组织等一个非常有用的脚本工具

如果你不能遵循什么是什么环路,这里是功能的全部:

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)


Answer 5:

lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)


Answer 6:

我必须做同样的65万页的文件。

使用枚举指数和整数与块大小的div它(//)

当该号码的改变关闭当前文件并打开一个新的

这是使用格式字符串一个python3解决方案。

chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/{i // chunk}'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'{file_name}', 'a')


文章来源: Splitting large text file into smaller text files by line numbers using Python