I am very new to Python and am looking to use it to parse a text file. The file has between 250-300 lines of the following format:
---- Mark Grey (mark.grey@gmail.com) changed status from Busy to Available @ 14/07/2010 16:32:36 ----
---- Silvia Pablo (spablo@gmail.com) became Available @ 14/07/2010 16:32:39 ----
I need to store the following information into another file (excel or text) for all the entries from this file
UserName/ID Previous Status New Status Date Time
So my result file should look like this for the above entried
Mark Grey/mark.grey@gmail.com Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo@gmail.com NaN Available 14/07/2010 16:32:39
Thanks in advance,
Any help would be really appreciated
Well, if i were to approach this problem, probably I'd start by splitting each entry into its own, separate string. This looks like it might be line oriented, so a
inputfile.split('\n')
is probably adequate. From there I would probably craft a regular expression to match each of the possible status changes, with subgroups wrapping each of the important fields.To get you started:
will get you an array of the parts of the match. I'd then suggest you use the
csv module
to store the results in a standard format.The two RE patterns of interest seem to be...:
so I'd do:
If the two patterns I described are not general enough, they may need to be tweaked, of course, but I think this general approach will be useful. While many Python users on Stack Overflow intensely dislike REs, I find them very useful for this kind of pragmatical ad hoc text processing.
Maybe the dislike is explained by others wanting to use REs for absurd uses such as ad hoc parsing of CSV, HTML, XML, ... -- and many other kinds of structured text formats for which perfectly good parsers exist! And also, other tasks well beyond REs' "comfort zone", and requiring instead solid general parser systems like pyparsing. Or at the other extreme super-simple tasks done perfectly well with simple strings (e.g. I remember a recent SO question which used
if re.search('something', s):
instead ofif 'something' in s:
!-).But for the reasonably broad swathe of tasks (excluding the very simplest ones at one end, and the parsing of structured or somewhat-complicated grammars at the other) for which REs are appropriate, there's really nothing wrong with using them, and I recommend to all programmers to learn at least REs' basics.
This makes assumptions about whitespace and also assumes that every line conforms to the pattern. You might want to add error checking (such as checking that
pat.match()
doesn't returnNone
) if you want to handle dirty input gracefully.thanks very much for all your comments. They were very useful. I wrote my code using the directory functionality. What it does is it reads through the file and creates an output file for each of the user with all his status updates. Here is the code pasted below.
Alex mentioned pyparsing and so here is a pyparsing approach to your same problem:
prints:
pyparsing allows you to attach names to the results fields (just like the named groups in Tom Pietzcker's RE-styled answer), plus parse-time actions to act on or manipulate the parsed actions - note the conversion of the separate date and time fields into a true datetime object, already converted and ready for processing after parsing with no additional muss nor fuss.
Here is a modified loop that just dumps out the parsed tokens and the named fields for each line:
prints:
I suspect that as you continue to work on this problem, you will find other variations on the format of the input text specifying how the user's state changed. In this case, you would just add another definition like
statechange1
orstatechange2
, and insert it intolinefmt
with the others. I feel that pyparsing's structuring of the parser definition helps developers come back to a parser after things have changed, and easily extend their parsing program.