I have table called Message which has column content of type JSON
My Model definition as requested is as following
class Message(db.Model):
...
content = db.Column(JSON)
...
now I perform text matching for simple search with the following query
Message.query.filter(Message.content['summary'].cast(Unicode).match(term))
it works nicely, until term has a utf-8 character, like German umlauts or French accents.
what would be the solution here?
also keep in mind I'm using Python 3
The problem lies in cast(Unicode)
of a Postgresql json
column. It simply CASTs the json
to the text type underlying SQLAlchemy's Unicode
, in case of Postgresql VARCHAR
. In other words it produces a string representation of JSON instead of extracting the text content. If your input contained escaped unicode codepoints, they're output as is in this case. Given a simple Test
model with a json
column data:
In [7]: t = Test(data={'summary': 'Tämä on summary.'})
In [8]: session.add(t)
In [9]: session.commit()
In [11]: session.query(Test.data['summary'].cast(Unicode)).scalar()
Out[11]: '"T\\u00e4m\\u00e4 on summary."'
It should be evident why a match with unescaped unicode characters will fail. The correct way to extract the text content, unescaping escaped unicode, is to use astext
, which uses the ->>
operator in Postgresql:
In [13]: session.query(Test.data['summary'].astext).scalar()
Out[13]: 'Tämä on summary.'
Quoting the JSON functions and operators documentation:
Note: Many of these functions and operators will convert Unicode escapes in JSON strings to the appropriate single character. This is a non-issue if the input is type jsonb, because the conversion was already done; but for json input, this may result in throwing an error, as noted in Section 8.14.
So in your case:
Message.query.\
filter(Message.content['summary'].astext.match(term))
Note that this only applies to json
type, not jsonb
, because the json
type does not convert unicode escapes on input. jsonb
on the other hand converts all unicode escapes to equivalent ASCII or UTF-8 characters for storage. If our Test
model contained a second column data2 jsonb
, with the exact same input, then the result would be:
In [11]: session.query(Test.data['summary'].cast(Unicode),
...: Test.data2['summary'].cast(Unicode)).first()
Out[11]: ('"T\\u00e4m\\u00e4 on summary."', '"Tämä on summary"')
Still, you should use astext
, if you want text instead of a string representation of JSON.