How to use Stanford LexParser for Chinese text?

2019-05-21 01:12发布

问题:

I can't seem to get the correct input encoding for Stanford NLP's LexParser.

How do I use the Stanford LexParser for Chinese text?

I've done the following to download the tool:

$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip
$ unzip stanford-parser-full-2015-04-20.zip 
$ cd stanford-parser-full-2015-04-20/

And my input text is in UTF-8:

$ echo "应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事 。" > input.txt

$ echo "应有尽有#VV 的#DEC 丰富#JJ 选择#NN 定#VV 将#AD 为#P 您#PN 的#DEG 旅程#NN 增添#VV 无数#CD 的#DEG 赏心#NN 乐事#NN  。#PUNCT" > pos-input.txt

According to the README.txt, the parser was trained on:

Chinese There are Chinese grammars trained just on mainland material from Xinhua and more mixed material from the LDC Chinese Treebank. The default input encoding is GB18030.

So I've tried with the UTF-8 file first:

$ bash lexparser-lang.sh Chinese 80 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz parsed input.txt
Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ...  done [1.0 sec].
Parsing file: input.txt
Parsing [sent. 1 len. 16]: 应有尽有 的1�7 丰富 选择 宄1�7 射1�7 丄1�7 悄1�7 的1�7 旅程 增添 无数 的1�7 赏心 乐事 〄1�7
Parsed file: input.txt [1 sentences].
Parsed 16 words in 1 sentences (21.00 wds/sec; 1.31 sents/sec).

It didn't seem to work. The parser produced this file, input.txt.parsed.80.stp

[out]:

$ cat input.txt.parsed.80.stp 
(FRAG (NR 应有尽有) (NR 的1�7) (NT 丰富) (NT 选择) (NN 宄1�7) (NN 射1�7) (NN 丄1�7) (NN 悄1�7) (NR 的1�7) (NT 旅程) (NT 增添) (NN 无数) (NN 的1�7) (NR 赏心) (NR 乐事) (VV 〄1�7))

Then i'ved tried to encode the sentence into GB18030:

$ bash lexparser-lang.sh Chinese 80 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz parsed input-gb18030.txt
Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ...  done [1.0 sec].
Parsing file: input-gb18030.txt
Parsing [sent. 1 len. 16]: Ӧ�о��� �� �ḻ ѡ�� �� �� Ϊ �� �� �ó� ���� ���� �� ���� ���� ��
Parsed file: input-gb18030.txt [1 sentences].
Parsed 16 words in 1 sentences (19.90 wds/sec; 1.24 sents/sec).
alvas@ubi:~/stanford-parser-full-2015-04-20$ cat input-gb18030.txt.parsed.80.stp 
(IP
  (NP
    (CP
      (IP
        (VP (VV Ӧ�о���)))
      (DEC ��))
    (ADJP (JJ �ḻ))
    (NP (NN ѡ��)))
  (VP (VV ��)
    (VP
      (ADVP (AD ��))
      (PP (P Ϊ)
        (NP
          (DNP
            (NP (PN ��))
            (DEG ��))
          (NP (NN �ó�))))
      (VP (VV ����)
        (NP
          (DNP
            (ADJP (JJ ����))
            (DEG ��))
          (NP (NN ����) (NN ����))))))
  (PU ��))

It seems like it's working but how do I convert the file back into UTF8?

I've tried this but it didn't work:

$ cat input-gb18030.txt.parsed.80.stp | python -c "print raw_input().decode('GB18030').encode('utf8')"
(IP

Here's some concluding question:

  • How do I convert between GB18030 to UTF8 and UTF8 to GB18030?
  • How do I use the Stanford LexParser for Chinese UTF8 text?

回答1:

I followed your steps and it shows that you can simply use encoding convertors to achieve your goal.

I use iconv in my testing.

iconv -f GB18030 -t UTF-8 input2.txt.parsed.80.stp -o output

Here is my output:

dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ cat input2.txt.parsed.80.stp
(IP
  (NP
    (CP
      (IP
        (VP (VV Ӧ�о���)))
      (DEC ��))
    (ADJP (JJ �ḻ))
    (NP (NN ѡ��)))
  (VP (VV ��)
    (VP
      (ADVP (AD ��))
      (PP (P Ϊ)
        (NP
          (DNP
            (NP (PN ��))
            (DEG ��))
          (NP (NN �ó�))))
      (VP (VV ����)
        (NP
          (DNP
            (ADJP (JJ ����))
            (DEG ��))
          (NP (NN ����) (NN ����))))))
  (PU ��))

dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ iconv -f GB18030 -t UTF-8 input2.txt.parsed.80.stp -o output
dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ cat output
(IP
  (NP
    (CP
      (IP
        (VP (VV 应有尽有)))
      (DEC 的))
    (ADJP (JJ 丰富))
    (NP (NN 选择)))
  (VP (VV 定)
    (VP
      (ADVP (AD 将))
      (PP (P 为)
        (NP
          (DNP
            (NP (PN 您))
            (DEG 的))
          (NP (NN 旅程))))
      (VP (VV 增添)
        (NP
          (DNP
            (ADJP (JJ 无数))
            (DEG 的))
          (NP (NN 赏心) (NN 乐事))))))
  (PU 。))