SQLAlchemy Text Matching data inside JSON field wi

2019-04-14 14:08发布

问题:

I have table called Message which has column content of type JSON

My Model definition as requested is as following

class Message(db.Model):
    ...
    content = db.Column(JSON)
    ...

now I perform text matching for simple search with the following query

Message.query.filter(Message.content['summary'].cast(Unicode).match(term))

it works nicely, until term has a utf-8 character, like German umlauts or French accents.

what would be the solution here?

also keep in mind I'm using Python 3

回答1:

The problem lies in cast(Unicode) of a Postgresql json column. It simply CASTs the json to the text type underlying SQLAlchemy's Unicode, in case of Postgresql VARCHAR. In other words it produces a string representation of JSON instead of extracting the text content. If your input contained escaped unicode codepoints, they're output as is in this case. Given a simple Test model with a json column data:

In [7]: t = Test(data={'summary': 'Tämä on summary.'})

In [8]: session.add(t)

In [9]: session.commit()

In [11]: session.query(Test.data['summary'].cast(Unicode)).scalar()
Out[11]: '"T\\u00e4m\\u00e4 on summary."'

It should be evident why a match with unescaped unicode characters will fail. The correct way to extract the text content, unescaping escaped unicode, is to use astext, which uses the ->> operator in Postgresql:

In [13]: session.query(Test.data['summary'].astext).scalar()
Out[13]: 'Tämä on summary.'

Quoting the JSON functions and operators documentation:

Note: Many of these functions and operators will convert Unicode escapes in JSON strings to the appropriate single character. This is a non-issue if the input is type jsonb, because the conversion was already done; but for json input, this may result in throwing an error, as noted in Section 8.14.

So in your case:

Message.query.\
    filter(Message.content['summary'].astext.match(term))

Note that this only applies to json type, not jsonb, because the json type does not convert unicode escapes on input. jsonb on the other hand converts all unicode escapes to equivalent ASCII or UTF-8 characters for storage. If our Test model contained a second column data2 jsonb, with the exact same input, then the result would be:

In [11]: session.query(Test.data['summary'].cast(Unicode),
    ...:               Test.data2['summary'].cast(Unicode)).first()
Out[11]: ('"T\\u00e4m\\u00e4 on summary."', '"Tämä on summary"')

Still, you should use astext, if you want text instead of a string representation of JSON.