The following code is supposed to create a new (modified) version of a frequency distribution (nltk.FreqDist). Both variables should then be the same length.
It works fine when a single instance of WebText is created. But when multiple WebText instances are created, then the new variable seems to be shared by all the objects.
For example:
import nltk
from operator import itemgetter
class WebText:
freq_dist_weighted = {}
def __init__(self, text):
tokens = nltk.wordpunct_tokenize(text) #tokenize
word_count = len(tokens)
freq_dist = nltk.FreqDist(tokens)
for word,frequency in freq_dist.iteritems():
self.freq_dist_weighted[word] = frequency/word_count*frequency
print len(freq_dist), len(self.freq_dist_weighted)
text1 = WebText("this is a test")
text2 = WebText("this is another test")
text3 = WebText("a final sentence")
results in
4 4
4 5
3 7
Which is incorrect. Since I am just transposing and modifying values, there should be the same numbers in each column.
If I reset the freq_dist_weighted just before the loop, it works fine:
import nltk
from operator import itemgetter
class WebText:
freq_dist_weighted = {}
def __init__(self, text):
tokens = nltk.wordpunct_tokenize(text) #tokenize
word_count = len(tokens)
freq_dist = nltk.FreqDist(tokens)
self.freq_dist_weighted = {}
for word,frequency in freq_dist.iteritems():
self.freq_dist_weighted[word] = frequency/word_count*frequency
print len(freq_dist), len(self.freq_dist_weighted)
text1 = WebText("this is a test")
text2 = WebText("this is another test")
text3 = WebText("a final sentence")
results in (correct):
4 4
4 4
3 3
This doesn't make sense to me.
I don't see why I would have to reset it, since it's isolated within the objects. Am I doing something wrong?
Your comment is blatantly wrong. Objects in a class scope are only initialized when the class is created; if you want a different object per instance then you need to move it into the initializer.
class WebText:
def __init__(self, text):
self.freq_dist_weighted = {} #### RESET the dictionary HERE ####
...
Your freq_dist_weighted
dictionary is a class attribute, not an instance attribute. Therefore it is shared among all instances of the class. (self.freq_dist_weighted
still refers to the class attribute; since there's no instance-specific attribute of that name, Python falls back to looking on the class.)
To make it an instance attribute, set it in your class's __init__()
method.
def __init__(self, text):
self.freq_dist_weighted = {}
...
class WebText:
freq_dist_weighted = {}
declares freq_dist_weighted
so that it is shared between all objects of type WebText
; essentially, this is like a static
member in C++.
If you want each WebText
object to have its own freq_dist_weighted
member (i.e. you can change it for one instance without changing it for another instance), you want to define it in __init__
:
class WebText:
def __init__(self):
self.freq_dist_weighted = {}
It works fine when a single instance of WebText is created. But when multiple WebText instances are created, then the new variable seems to be shared by all the objects.
Well, yes; of course it would work fine with a single instance when all one of them is sharing the value. ;)
The value is shared because Python follows a very simple rule: the things you define inside the class
block belong to the class. I.e., they don't belong to instances. To attach something to an instance, you have to do it explicitly. This is normally done in __init__
, but in normal cases (i.e. if you haven't used __slots__
) can be done at any time. Assigning to an attribute of an object is just like assigning to an element of a list; there are no real protections because we're all mature adults here and are assumed to be responsible.
def __init__(self, text):
self.freq_dist_weighted = {}
# and proceed to modify it
Alternately:
def __init__(self, text):
freq_dist_weighted = {}
# prepare the dictionary contents first
self.freq_dist_weighted = freq_dist_weighted