I'm working on processing Tweets from Twitter and storing them in a database (MySQL).
I have my process running perfectly but sometimes I get an error like this one:
2012-08-31 08:11:23,303 WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: 1366, SQLState: HY000
2012-08-31 08:11:23,304 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - Incorrect string value: '\xF0\x9F\x98\x9D #...' for column 'twe_text' at row 1
When looking for the problematic tweet in my logs I find the following one:
2012-08-31 08:11:22,971 INFO com.myapp.TweetLoaderJob - Text for tweet 241175722096480256: RT @totallytoyosi_: My go
odies, my goodies, not your goodies <U+1F61D> #m&ms #sweeties #goodies #food @ The Ritzy Cinema Café, Brixton htt ...
And, finally, looking what the hell is , I discovered that it is an emoticon that Twitter sends as-is
I have debugged, looking only for this specific tweet and my eclipse seems to not recognize this encoding character. So the question is, how can I handle this exception? I looked for configuring my MySQL database, but I cannot change the encoding (it's a requirement), so my option is to avoid managing this kind of tweets or supress this complicated character.
But how to do it, if Java does not recognize it?
You could filter your strings and remove the undesired part (with a simple regexp like
<U+[^>]+>
) before storing them in your database.