finding a pattern match and concatenating the rest

2020-01-20 00:56发布

I have a small data set to clean. I have opened the text file in Pycharm. The data set is like this:

Code-6667+
Name of xyz company+ 
Address +
Number+ 
Contact person+
Code-6668+
Name of abc company, Address, number, contact person+
Code-6669+
name of company, Address+
number, contact person +

I need to separate the code lines and concatenate (or paste) the rest of the lines together till the next code line comes. This way I could separate my data into 2 fields, namely, the code of the company and secondly all the details all in one field. The eventual output being a table. The output should be something like this :

Code6667 - Company details 
Code6668 - Company details

Is there a way I could use a loop to do this? Tried this in R programming but now attempting it in Python.

3条回答
戒情不戒烟
2楼-- · 2020-01-20 01:26

I don't know what these + mean in your example.. if they are part of the file you'll want to deal with them as well but here is a way to extract the data (with regex) in a dictionary with the code as key and the info as a list.. afterwards you can format it however you want

This is assuming your entries, when on the same line are separated by ,, but it can be adapted for anything else. Also this is based on the fact that in your example every code is on a new line, and has no info after it.

import re

res = {}

with open('in.txt', 'r') as f:
    current = None
    for line in f.readlines():
        if re.match(r"Code-\d+", line):
            current = line.strip()
            res[current] = []
            continue
        if current: res[current] += line.strip().split(",")

print res

result:

{'Code-6667+': ['Name of xyz company+', 'Address +', 'Number+', 'Contact person+'], 'Code-6668+': ['Name of abc company', 'Address', ' number', ' contact person+'], 'Code-6669+': ['name of company ', ' Address+', 'number ', ' contact person +']}
查看更多
The star\"
3楼-- · 2020-01-20 01:30

(Note: I'm note quite sure whether you want to keep the + sign. The following codes assume you do. Otherwise it's easy to get rid of the + with a bit of string manipulations).

 Input file

Here is the input file...

dat1.txt:

Code-6667+
Name of xyz company+ 
Address +
Number+ 
Contact person+
Code-6668+
Name of abc company,Address, number, contact person+
Code-6669+
name of company , Address+
number , contact person +

Code

Here is the code... (comment / uncomment the print block for Python 2.x/3.x version)

mycode.py:

import sys
print sys.version

# open input text file
f = open("dat1.txt", "r")

# initialise our final output - a phone book
phone_book = {}

# parse text file data to phone book, in a specific format
code = ''
for line in f:
        if line[:5] == 'Code-':
            code = (line[:4] + line[5:]).strip()
            phone_book[code] = []
        elif code:
            phone_book[code].append(line.strip())    
        else:
            continue

# close text file
f.close()


# print result to console (for ease of debugging). Comment this block if you want:
for key, value in phone_book.items():

    #python 3.x
    # print("{0} - Company details: {1}".format(key, value))

    #python 2.x
    print key + " - Company details: " + "".join(value)

# write phone_book to dat2.txt
f2 = open("dat2.txt", "w")
for key, value in phone_book.items():
    f2.write("{0} - Company details: {1}\n".format(key, value))
f2.close()

 Output

Here is what you will see in console (via print()) or dat2.txt (via f2.write())...

# Code6667+ - Company details: ['Name of xyz company+', 'Address +', 'Number+', 'Contact person+']
# Code6668+ - Company details: ['Name of abc company,Address, number, contact person+']
# Code6669+ - Company details: ['name of company , Address+', 'number , contact person +']

 Screenshot

enter image description here

查看更多
女痞
4楼-- · 2020-01-20 01:32

Your question wasn't really clear, following a snippet to print out a line for each company, starting with "CodeXXXX - " and following with the other details.

with open(FILEPATH, 'r') as f:
    current_line = None
    for line in f:
        line = line.strip()
        if line.startswith('Code-'):
            # new company
            if current_line is not None:
                print(current_line)

            # create a line that starts with 'CodeXXXX - '
            current_line = line.replace('-', '').replace('+', '') + ' - '

        else:
            current_line += line
            current_line += ' '

Output of your example code:

Code6667 - Name of xyz company+ Address + Number+ Contact person+ 
Code6668 - Name of abc company,Address, number, contact person+ 
查看更多
登录 后发表回答