Parsing a CS:GO language file with encoding in Pyt

2020-03-27 05:46发布

问题:

This topic is related to the Parsing a CS:GO script file in Python theme, but there is another problem. I'm working on a content from CS:GO and now i'm trying to make a python tool importing all data from from /scripts/ folder into Python dictionaries.

The next step after parsing data is parsing Language resource file from /resources and making relations between dictionaries and language.

There is an original file for Eng localization: https://github.com/spec45as/PySteamBot/blob/master/csgo_english.txt

The file format is similar to the previous task, but I have faced with another problems. All language files is in UTF-16-LE encode, i couldn't understand the way of working with encoded files and strings in Python (I'm mostly working with Java) I have tried to make some solutions, based on open(fileName, encoding='utf-16-le').read(), but i don't know how to work with such encoded strings in pyparsing.

pyparsing.ParseException: Expected quoted string, starting with " ending with " (at char 0), (line:1, col:1)

Another problem is lines with \"-like expressions, for example:

"musickit_midnightriders_01_desc"       "\"HAPPY HOLIDAYS, ****ERS!\"\n    -Midnight Riders"

How to parse these symbols if I want to leave these lines as they are?

回答1:

There are a few new wrinkles to this input file that were not in the original CS:GO example:

  1. embedded \" escaped quotes in some of the value strings
  2. some of the quoted value strings span multiple lines
  3. some of the values end with a trailing environment condition (such as [$WIN32], [$OSX])
  4. embedded comments in the file, marked with '//'

The first two are addressed by modifying the definition of value_qs. Since values are now more fully-featured than keys, I decided to use separate QuotedString definitions for them:

key_qs = QuotedString('"').setName("key_qs")
value_qs = QuotedString('"', escChar='\\', multiline=True).setName("value_qs")

The third requires a bit more refactoring. The use of these qualifying conditions is similar to #IFDEF macros in C - they enable/disable the definition only if the environment matches the condition. Some of these conditions were even boolean expressions:

  • [!$PS3]
  • [$WIN32||$X360||$OSX]
  • [!$X360&&!$PS3]

This could lead to duplicate keys in the definition file, such as in these lines:

"Menu_Dlg_Leaderboards_Lost_Connection"     "You must be connected to Xbox LIVE to view Leaderboards. Please check your connection and try again." [$X360]
"Menu_Dlg_Leaderboards_Lost_Connection"     "You must be connected to PlayStation®Network and Steam to view Leaderboards. Please check your connection and try again." [$PS3]
"Menu_Dlg_Leaderboards_Lost_Connection"     "You must be connected to Steam to view Leaderboards. Please check your connection and try again."

which contain 3 definitions for the key "Menu_Dlg_Leaderboards_Lost_Connection", depending on what environment values were set.

In order to not lose these values when parsing the file, I chose to modify the key at parse time by appending the condition if one was present. This code implements the change:

LBRACK,RBRACK = map(Suppress, "[]")
qualExpr = Word(alphanums+'$!&|')
qualExprCondition = LBRACK + qualExpr + RBRACK

key_value = Group(key_qs + value + Optional(qualExprCondition("qual")))
def addQualifierToKey(tokens):
    tt = tokens[0]
    if 'qual' in tt:
        tt[0] += '/' + tt.pop(-1)
key_value.setParseAction(addQualifierToKey)

So that in the sample above, you would get 3 keys:

  • Menu_Dlg_Leaderboards_Lost_Connection/$X360
  • Menu_Dlg_Leaderboards_Lost_Connection/$PS3
  • Menu_Dlg_Leaderboards_Lost_Connection

Lastly, the handling of comments, probably the easiest. Pyparsing has built-in support for skipping over comments, just like whitespace. You just need to define the expression for the comment, and have the top-level parser ignore it. To support this feature, several common comment forms are pre-defined in pyparsing. In this case, the solution is just to change the final parser defintion to:

parser.ignore(dblSlashComment)

And LASTLY lastly, there is a minor bug in the implementation of QuotedString, in which standard whitespace string literals like \t and \n are not handled, and are just treated as an unnecessarily-escaped 't' or 'n'. So for now, when this line is parsed:

"SFUI_SteamOverlay_Text"  "This feature requires Steam Community In-Game to be enabled.\n\nYou might need to restart the game after you enable this feature in Steam:\nSteam -> File -> Settings -> In-Game: Enable Steam Community In-Game\n" [$WIN32]

For the value string you just get:

This feature requires Steam Community In-Game to be enabled.nnYou 
might need to restart the game after you enable this feature in 
Steam:nSteam -> File -> Settings -> In-Game: Enable Steam Community 
In-Gamen

instead of:

This feature requires Steam Community In-Game to be enabled.

You might need to restart the game after you enable this feature in Steam:
Steam -> File -> Settings -> In-Game: Enable Steam Community In-Game

I will have to fix this behavior in the next release of pyparsing.

Here is the final parser code:

from pyparsing import (Suppress, QuotedString, Forward, Group, Dict, 
    ZeroOrMore, Word, alphanums, Optional, dblSlashComment)

LBRACE,RBRACE = map(Suppress, "{}")

key_qs = QuotedString('"').setName("key_qs")
value_qs = QuotedString('"', escChar='\\', multiline=True).setName("value_qs")

# use this code to convert integer values to ints at parse time
def convert_integers(tokens):
    if tokens[0].isdigit():
        tokens[0] = int(tokens[0])
value_qs.setParseAction(convert_integers)

LBRACK,RBRACK = map(Suppress, "[]")
qualExpr = Word(alphanums+'$!&|')
qualExprCondition = LBRACK + qualExpr + RBRACK

value = Forward()
key_value = Group(key_qs + value + Optional(qualExprCondition("qual")))
def addQualifierToKey(tokens):
    tt = tokens[0]
    if 'qual' in tt:
        tt[0] += '/' + tt.pop(-1)
key_value.setParseAction(addQualifierToKey)

struct = (LBRACE + Dict(ZeroOrMore(key_value)) + RBRACE).setName("struct")
value <<= (value_qs | struct)
parser = Dict(key_value)
parser.ignore(dblSlashComment)

sample = open('cs_go_sample2.txt').read()
config = parser.parseString(sample)


print (config.keys())
for k in config.lang.keys():
    print ('- ' + k)

#~ config.lang.pprint()
print (config.lang.Tokens.StickerKit_comm01_burn_them_all)
print (config.lang.Tokens['SFUI_SteamOverlay_Text/$WIN32'])