How to avoid Twitter emoticon character while proc

2019-06-26 21:08发布

I'm working on processing Tweets from Twitter and storing them in a database (MySQL).

I have my process running perfectly but sometimes I get an error like this one:

2012-08-31 08:11:23,303 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - SQL Error: 1366, SQLState: HY000
2012-08-31 08:11:23,304 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper  - Incorrect string value: '\xF0\x9F\x98\x9D #...' for column 'twe_text' at row 1

When looking for the problematic tweet in my logs I find the following one:

 2012-08-31 08:11:22,971 INFO com.myapp.TweetLoaderJob  - Text for tweet 241175722096480256: RT @totallytoyosi_: My go
odies, my goodies, not your goodies  <U+1F61D> #m&ms #sweeties #goodies #food  @ The Ritzy Cinema Café, Brixton htt ...

And, finally, looking what the hell is , I discovered that it is an emoticon that Twitter sends as-is

I have debugged, looking only for this specific tweet and my eclipse seems to not recognize this encoding character. So the question is, how can I handle this exception? I looked for configuring my MySQL database, but I cannot change the encoding (it's a requirement), so my option is to avoid managing this kind of tweets or supress this complicated character.

But how to do it, if Java does not recognize it?

1条回答
走好不送
2楼-- · 2019-06-26 21:29

You could filter your strings and remove the undesired part (with a simple regexp like <U+[^>]+>) before storing them in your database.

查看更多
登录 后发表回答