In Python, I have just read a line form a text file and I'd like to know how to code to ignore comments with a hash # at the beginning of the line.
I think it should be something like this:
for
if line !contain #
then ...process line
else end for loop
But I'm new to Python and I don't know the syntax
I know that this is an old thread, but this is a generator function that I use for my own purposes. It strips comments no matter where they appear in the line, as well as stripping leading/trailing whitespace and blank lines. The following source text:
will yield:
Here is documented code, which includes a demo:
The normal use case will be to strip the comments from a file (i.e., a hosts file, as in my example above). If this is the case, then the tail end of the above code would be modified to:
I tend to use
This will ignore the whole line, though the answer which includes rpartition has my upvote as it can include any information from before the #
A more compact version of a filtering expression can also look like this:
(l for ... )
is called "generator expression" which acts here as a wrapping iterator that will filter out all unneeded lines from file while iterating over it. Don't confuse it with the same thing in square brakets[l for ... ]
which is a "list comprehension" that will first read all the lines from the file into memory and only then will start iterating over it.Sometimes you might want to have it less one-liney and more readable:
All the filters will be executed on the fly in one iteration.
This is the shortest possible form:
The
startswith()
method on a string returns True if the string you call it on starts with the string you passed in.While this is okay in some circumstances like shell scripts, it has two problems. First, it doesn't specify how to open the file. The default mode for opening a file is
'r'
, which means 'read the file in binary mode'. Since you're expecting a text file it is better to open it with'rt'
. Although this distinction is irrelevant on UNIX-like operating systems, it's important on Windows (and on pre-OS X Macs).The second problem is the open file handle. The
open()
function returns a file object, and it's considered good practice to close files when you're done with them. To do that, call theclose()
method on the object. Now, Python will probably do this for you, eventually; in Python objects are reference-counted, and when an object's reference count goes to zero it gets freed, and at some point after an object is freed Python will call its destructor (a special method called__del__
). Note that I said probably: Python has a bad habit of not actually calling the destructor on objects whose reference count drops to zero shortly before the program finishes. I guess it's in a hurry!For short-lived programs like shell scripts, and particularly for file objects, this doesn't matter. Your operating system will automatically clean up any file handles left open when the program finishes. But if you opened the file, read the contents, then started a long computation without explicitly closing the file handle first, Python is likely to leave the file handle open during your computation. And that's bad practice.
This version will work in any 2.x version of Python, and fixes both the problems I discussed above:
This is the best general form for older versions of Python.
As suggested by steveha, using the "with" statement is now considered best practice. If you're using 2.6 or above you should write it this way:
The "with" statement will clean up the file handle for you.
In your question you said "lines that start with #", so that's what I've shown you here. If you want to filter out lines that start with optional whitespace and then a '#', you should strip the whitespace before looking for the '#'. In that case, you should change this:
to this:
In Python, strings are immutable, so this doesn't change the value of
line
. Thelstrip()
method returns a copy of the string with all its leading whitespace removed.you can use startswith()
eg
I recommend you don't ignore the whole line when you see a
#
character; just ignore the rest of the line. You can do that easily with a string method function calledpartition
:partition
returns a tuple: everything before the partition string, the partition string, and everything after the partition string. So, by indexing with[0]
we take just the part before the partition string.EDIT: If you are using a version of Python that doesn't have
partition()
, here is code you could use:This splits the string on a '#' character, then keeps everything before the split. The
1
argument makes the.split()
method stop after a one split; since we are just grabbing the 0th substring (by indexing with[0]
) you would get the same answer without the1
argument, but this might be a little bit faster. (Simplified from my original code thanks to a comment from @gnr. My original code was messier for no good reason; thanks, @gnr.)You could also just write your own version of
partition()
. Here is one calledpart()
:@dalle noted that '#' can appear inside a string. It's not that easy to handle this case correctly, so I just ignored it, but I should have said something.
If your input file has simple enough rules for quoted strings, this isn't hard. It would be hard if you accepted any legal Python quoted string, because there are single-quoted, double-quoted, multiline quotes with a backslash escaping the end-of-line, triple quoted strings (using either single or double quotes), and even raw strings! The only possible way to correctly handle all that would be a complicated state machine.
But if we limit ourselves to just a simple quoted string, we can handle it with a simple state machine. We can even allow a backslash-quoted double quote inside the string.
I didn't really want to get this complicated in a question tagged "beginner" but this state machine is reasonably simple, and I hope it will be interesting.