I came across this question about memory management of dictionaries, which mentions the intern function. What exactly does it do, and when would it be used?
To give an example:
If I have a set called seen, that contains tuples in the form (string1,string2), which I use to check for duplicates, would storing (intern(string1),intern(string2)) improve performance w.r.t. memory or speed?
From the Python 3 documentation:
Clarification:
As the documentation suggests, the
sys.intern
function is intended to be used for performance optimization.The
sys.intern
function maintains a table of interned strings. When you attempt to intern a string, the function looks it up in the table and:If the string does not exists (hasn't been interned yet) the function saves it in the table and returns it from the interned strings table.
In the above example,
a
holds the interned string. Even though it is not visible, thesys.intern
function has saved the'why do pangolins dream of quiche'
string object in the interned strings table.If the string exists (has been interned) the function returns it from the interned strings table.
Even though it is not immediately visible, because the string
'why do pangolins dream of quiche'
has been interned before,b
holds now the same string object asa
.If we create the same string without using intern, we end up with two different string objects that have the same value.
By using
sys.intern
you ensure that you never create two string objects that have the same value—when you request the creation of a second string object with the same value as an existing string object, you receive a reference to the pre-existing string object. This way, you are saving memory. Also, string objects comparison is now very efficient because it is carried out by comparing the memory addresses of the two string objects instead of their content.They weren't talking about keyword
intern
because there is no such thing in Python. They were talking about non-essential built-in functionintern
. Which in py3k has been moved tosys.intern
. Docs have an exhaustive description.It returns a canonical instance of the string.
Therefore if you have many string instances that are equal you save memory, and in addition you can also compare canonicalized strings by identity instead of equality which is faster.
Essentially intern looks up (or stores if not present) the string in a collection of interned strings, so all interned instances will share the same identity. You trade the one-time cost of looking up this string for faster comparisons (the compare can return True after just checking for identity, rather than having to compare each character), and reduced memory usage.
However, python will automatically intern strings that are small, or look like identifiers, so you may find you get no improvement because your strings are already being interned behind the scenes. For example:
In the past, one disadvantage was that interned strings were permanent. Once interned, the string memory was never freed even after all references were dropped. I think this is no longer the case for more recent versions of python though.
This idea seems around us in several languages, including Python, Java etc.
String Interning