-->

How can lexing efficiency be improved?

2019-02-15 02:48发布

站内文章 / 移动开发

30 0

问题:

In parsing a large 3 gigabyte file with DCG, efficiency is of importance.

The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help.

Indexing is a technique used to quickly select candidate clauses of a predicate for a specific goal. In most Prolog systems, indexing is done (only) on the first argument of the head. If this argument is instantiated to an atom, integer, float or compound term with functor, hashing is used to quickly select all clauses where the first argument may unify with the first argument of the goal. SWI-Prolog supports just-in-time and multi-argument indexing. See section 2.18.

Can someone give an example of using indexing for lexing and possibly explain how it improves efficiency?

Details

Note: I changed some of the names before coping the source code into this question. If you find a mistake feel free to edit it here or leave me a comment and I will gladly fix it.

Currently my lexer/tokenizer (based on mzapotoczny/prolog-interpreter parser.pl) is this

% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.

:- use_module(library(pio)). 
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).

lexer(Tokens) -->
   white_space,
   (
       (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

white_space -->
   [Char], { code_type(Char, space) }, !, white_space.
white_space -->
    "--", whole_line, !, white_space.
white_space -->
   [].

whole_line --> "\n", !.
whole_line --> [_], whole_line.

digit(D) -->
   [D],
      { code_type(D, digit) }.

digits([D|T]) -->
   digit(D),
   !,
   digits(T).
digits([]) -->
   [].

number(D, N) -->
   digits(Ds),
      { number_chars(N, [D|Ds]) }.

letter(L) -->
   [L], { code_type(L, alpha) }.

alphanum([A|T]) -->
   [A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
   [].

alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).

identifier(L, Id) -->
   alphanum(As),
      { atom_codes(Id, [L|As]) }.

Here are some helper predicates used for development and testing.

read_file_for_lexing_and_user_review(Path) :-
    open(Path,read,Input),
    read_input_for_user_review(Input), !,
    close(Input).

read_file_for_lexing_and_performance(Path,Limit) :-
    open(Path,read,Input),
    read_input_for_performance(Input,0,Limit), !,
    close(Input).

read_input(Input) :-
    at_end_of_stream(Input).

read_input(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    read_input(Input).

read_input_for_user_review(Input) :-
    at_end_of_stream(Input).

read_input_for_user_review(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line_for_user_review(Line),
    nl,
    print('Press spacebar to continue or any other key to exit: '),
    get_single_char(Key),
    process_user_continue_or_exit_key(Key,Input).

read_input_for_performance(Input,Count,Limit) :-
    Count >= Limit.

read_input_for_performance(Input,_,_) :-
    at_end_of_stream(Input).

read_input_for_performance(Input,Count0,Limit) :-
    % print(Count0),
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    Count is Count0 + 1,
    read_input_for_performance(Input,Count,Limit).

process_user_continue_or_exit_key(32,Input) :-  % space bar
    nl, nl,
    read_input_for_user_review(Input).

process_user_continue_or_exit_key(Key) :-
    Key \= 32.

lex_line_for_user_review(Line) :-
    lex_line(Line,TokList),
    print(Line),
    nl,
    print(TokList),
    nl.

lex_line(Line,TokList) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

lex_line(Line) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

read_user_input_for_lexing_and_user_review :-
    print('Enter a line to parse or just Enter to exit: '),
    nl,
    read_string(user, "\n", "\r", _, String),
    nl,
    lex_line_for_user_review(String),
    nl,
    continue_user_input_for_lexing_and_user_review(String).

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,N),
    N > 0,
    read_user_input_for_lexing_and_user_review.

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,0).

read_user_input_for_lexing_and_user_review/0 allows a user to enter a string at the terminal for lexing and review the tokens.

read_file_for_lexing_and_user_review/1 Reads a file for lexing and review the tokens for each line one line at a time.

read_file_for_lexing_and_performance/2 Reads a file for lexing with a limit on the number of lines to lex. This is for use with gathering basic performance statistics to measure efficiency. Meant to be used with time/1.

Examples of debug and testing predicates.

?- read_user_input_for_lexing_and_user_review.
'Enter a line to parse or just Enter to exit: '
|: ID   GRAA_HUMAN              Reviewed;         262 AA.

"ID   GRAA_HUMAN              Reviewed;         262 AA."
[tokVar('ID'),tokVar('GRAA'),tokUnderscore,tokVar('HUMAN'),tokVar('Reviewed'),tokSemicolon,tokNumber(262),tokVar('AA'),tokPeriod]

'Enter a line to parse or just Enter to exit: '
|: 
...

?- read_file_for_lexing_and_user_review('C:/example_lines.dat').
"ID   GRAA_HUMAN              Reviewed;         262 AA."
[tokVar('ID'),tokVar('GRAA'),tokUnderscore,tokVar('HUMAN'),tokVar('Reviewed'),tokSemicolon,tokNumber(262),tokVar('AA'),tokPeriod]

'Press spacebar to continue or any other key to exit: '

"AC   P12544; A4PHN1; Q6IB36;"
[tokVar('AC'),tokVar('P12544'),tokSemicolon,tokVar('A4PHN1'),tokSemicolon,tokVar('Q6IB36'),tokSemicolon]

'Press spacebar to continue or any other key to exit: '

"DT   01-OCT-1989, integrated into UniProtKB/Swiss-Prot."
[tokVar('DT'),tokNumber(1),tokMinus,tokVar('OCT'),tokMinus,tokNumber(1989),tokComma,tokVar(integrated),tokVar(into),tokVar('UniProtKB'),tokForwardSlash,tokVar('Swiss'),tokMinus,tokVar('Prot'),tokPeriod]

'Press spacebar to continue or any other key to exit: '

"//"
[tokForwardSlash,tokForwardSlash]

'Press spacebar to continue or any other key to exit: '

true.

?- time(read_file_for_lexing_and_performance('C:/example_lines.dat',1000000)).
% 815,414,350 inferences, 63.094 CPU in 64.251 seconds (98% CPU, 12923853 Lips)
true.

Related question

How to see all of the answers in SWI-Prolog without pressing spacebar? Specifically see the section on details.

Real world example data. (UniProt Knowledgebase Structure of a sequence entry)

File format

ID   GRAA_HUMAN              Reviewed;         262 AA.
AC   P12544; A4PHN1; Q6IB36;
DT   01-OCT-1989, integrated into UniProtKB/Swiss-Prot.
DT   11-JAN-2011, sequence version 2.
DT   07-JUN-2017, entry version 186.
DE   RecName: Full=Granzyme A;
DE            EC=3.4.21.78;
DE   AltName: Full=CTL tryptase;
DE   AltName: Full=Cytotoxic T-lymphocyte proteinase 1;
DE   AltName: Full=Fragmentin-1;
DE   AltName: Full=Granzyme-1;
DE   AltName: Full=Hanukkah factor;
DE            Short=H factor;
DE            Short=HF;
DE   Flags: Precursor;
GN   Name=GZMA; Synonyms=CTLA3, HFSP;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM ALPHA), AND VARIANT THR-121.
RC   TISSUE=T-cell;
RX   PubMed=3257574; DOI=10.1073/pnas.85.4.1184;
RA   Gershenfeld H.K., Hershberger R.J., Shows T.B., Weissman I.L.;
RT   "Cloning and chromosomal assignment of a human cDNA encoding a T cell-
RT   and natural killer cell-specific trypsin-like serine protease.";
RL   Proc. Natl. Acad. Sci. U.S.A. 85:1184-1188(1988).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM ALPHA), AND VARIANT
RP   THR-121.
RA   Ebert L., Schick M., Neubert P., Schatten R., Henze S., Korn B.;
RT   "Cloning of human full open reading frames in Gateway(TM) system entry
RT   vector (pDONR201).";
RL   Submitted (JUN-2004) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
RX   PubMed=15372022; DOI=10.1038/nature02919;
RA   Schmutz J., Martin J., Terry A., Couronne O., Grimwood J., Lowry S.,
RA   Gordon L.A., Scott D., Xie G., Huang W., Hellsten U., Tran-Gyamfi M.,
RA   She X., Prabhakar S., Aerts A., Altherr M., Bajorek E., Black S.,
RA   Branscomb E., Caoile C., Challacombe J.F., Chan Y.M., Denys M.,
RA   Detter J.C., Escobar J., Flowers D., Fotopulos D., Glavina T.,
RA   Gomez M., Gonzales E., Goodstein D., Grigoriev I., Groza M.,
RA   Hammon N., Hawkins T., Haydu L., Israni S., Jett J., Kadner K.,
RA   Kimball H., Kobayashi A., Lopez F., Lou Y., Martinez D., Medina C.,
RA   Morgan J., Nandkeshwar R., Noonan J.P., Pitluck S., Pollard M.,
RA   Predki P., Priest J., Ramirez L., Retterer J., Rodriguez A.,
RA   Rogers S., Salamov A., Salazar A., Thayer N., Tice H., Tsai M.,
RA   Ustaszewska A., Vo N., Wheeler J., Wu K., Yang J., Dickson M.,
RA   Cheng J.-F., Eichler E.E., Olsen A., Pennacchio L.A., Rokhsar D.S.,
RA   Richardson P., Lucas S.M., Myers R.M., Rubin E.M.;
RT   "The DNA sequence and comparative analysis of human chromosome 5.";
RL   Nature 431:268-274(2004).
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM ALPHA), AND VARIANT
RP   THR-121.
RC   TISSUE=Blood;
RX   PubMed=15489334; DOI=10.1101/gr.2596504;
RG   The MGC Project Team;
RT   "The status, quality, and expansion of the NIH full-length cDNA
RT   project: the Mammalian Gene Collection (MGC).";
RL   Genome Res. 14:2121-2127(2004).
RN   [5]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA] OF 1-72, NUCLEOTIDE SEQUENCE [MRNA]
RP   OF 1-34 (ISOFORM BETA), ALTERNATIVE PROMOTER USAGE, AND INDUCTION.
RX   PubMed=17180578; DOI=10.1007/s10038-006-0099-9;
RA   Ruike Y., Katsuma S., Hirasawa A., Tsujimoto G.;
RT   "Glucocorticoid-induced alternative promoter usage for a novel 5'
RT   variant of granzyme A.";
RL   J. Hum. Genet. 52:172-178(2007).
RN   [6]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA] OF 1-23.
RA   Goralski T.J., Krensky A.M.;
RT   "The upstream region of the human granzyme A locus contains both
RT   positive and negative transcriptional regulatory elements.";
RL   Submitted (NOV-1995) to the EMBL/GenBank/DDBJ databases.
RN   [7]
RP   PROTEIN SEQUENCE OF 29-53.
RX   PubMed=3047119;
RA   Poe M., Bennett C.D., Biddison W.E., Blake J.T., Norton G.P.,
RA   Rodkey J.A., Sigal N.H., Turner R.V., Wu J.K., Zweerink H.J.;
RT   "Human cytotoxic lymphocyte tryptase. Its purification from granules
RT   and the characterization of inhibitor and substrate specificity.";
RL   J. Biol. Chem. 263:13215-13222(1988).
RN   [8]
RP   PROTEIN SEQUENCE OF 29-40, AND CHARACTERIZATION.
RX   PubMed=3262682;
RA   Hameed A., Lowrey D.M., Lichtenheld M., Podack E.R.;
RT   "Characterization of three serine esterases isolated from human IL-2
RT   activated killer cells.";
RL   J. Immunol. 141:3142-3147(1988).
RN   [9]
RP   PROTEIN SEQUENCE OF 29-39, AND CHARACTERIZATION.
RX   PubMed=3263427;
RA   Kraehenbuhl O., Rey C., Jenne D.E., Lanzavecchia A., Groscurth P.,
RA   Carrel S., Tschopp J.;
RT   "Characterization of granzymes A and B isolated from granules of
RT   cloned human cytotoxic T lymphocytes.";
RL   J. Immunol. 141:3471-3477(1988).
RN   [10]
RP   FUNCTION AS SET PROTEASE.
RX   PubMed=11555662; DOI=10.1074/jbc.M108137200;
RA   Beresford P.J., Zhang D., Oh D.Y., Fan Z., Greer E.L., Russo M.L.,
RA   Jaju M., Lieberman J.;
RT   "Granzyme A activates an endoplasmic reticulum-associated caspase-
RT   independent nuclease to induce single-stranded DNA nicks.";
RL   J. Biol. Chem. 276:43285-43293(2001).
RN   [11]
RP   FUNCTION AS SET PROTEASE.
RX   PubMed=12628186; DOI=10.1016/S0092-8674(03)00150-8;
RA   Fan Z., Beresford P.J., Oh D.Y., Zhang D., Lieberman J.;
RT   "Tumor suppressor NM23-H1 is a granzyme A-activated DNase during CTL-
RT   mediated apoptosis, and the nucleosome assembly protein SET is its
RT   inhibitor.";
RL   Cell 112:659-672(2003).
RN   [12]
RP   FUNCTION, AND INTERACTION WITH APEX1.
RX   PubMed=12524539; DOI=10.1038/ni885;
RA   Fan Z., Beresford P.J., Zhang D., Xu Z., Novina C.D., Yoshida A.,
RA   Pommier Y., Lieberman J.;
RT   "Cleaving the oxidative repair protein Ape1 enhances cell death
RT   mediated by granzyme A.";
RL   Nat. Immunol. 4:145-153(2003).
RN   [13]
RP   FUNCTION AS SET PROTEASE.
RX   PubMed=16818237; DOI=10.1016/j.molcel.2006.06.005;
RA   Chowdhury D., Beresford P.J., Zhu P., Zhang D., Sung J.S., Demple B.,
RA   Perrino F.W., Lieberman J.;
RT   "The exonuclease TREX1 is in the SET complex and acts in concert with
RT   NM23-H1 to degrade DNA during granzyme A-mediated cell death.";
RL   Mol. Cell 23:133-142(2006).
RN   [14]
RP   GLYCOSYLATION [LARGE SCALE ANALYSIS] AT ASN-170.
RC   TISSUE=Liver;
RX   PubMed=19159218; DOI=10.1021/pr8008012;
RA   Chen R., Jiang X., Sun D., Han G., Wang F., Ye M., Wang L., Zou H.;
RT   "Glycoproteomics analysis of human liver tissue by combination of
RT   multiple enzyme digestion and hydrazide chemistry.";
RL   J. Proteome Res. 8:651-661(2009).
RN   [15]
RP   3D-STRUCTURE MODELING OF 29-262.
RX   PubMed=3237717; DOI=10.1002/prot.340040306;
RA   Murphy M.E.P., Moult J., Bleackley R.C., Gershenfeld H.,
RA   Weissman I.L., James M.N.G.;
RT   "Comparative molecular model building of two serine proteinases from
RT   cytotoxic T lymphocytes.";
RL   Proteins 4:190-204(1988).
RN   [16]
RP   X-RAY CRYSTALLOGRAPHY (2.4 ANGSTROMS) OF 29-262 IN COMPLEX WITH A
RP   TRIPEPTIDE CMK INHIBITOR.
RX   PubMed=12819769; DOI=10.1038/nsb944;
RA   Bell J.K., Goetz D.H., Mahrus S., Harris J.L., Fletterick R.J.,
RA   Craik C.S.;
RT   "The oligomeric structure of human granzyme A is a determinant of its
RT   extended substrate specificity.";
RL   Nat. Struct. Biol. 10:527-534(2003).
RN   [17]
RP   X-RAY CRYSTALLOGRAPHY (2.5 ANGSTROMS) OF 29-262 IN COMPLEX WITH
RP   SUBSTRATE.
RX   PubMed=12819770; DOI=10.1038/nsb945;
RA   Hink-Schauer C., Estebanez-Perpina E., Kurschus F.C., Bode W.,
RA   Jenne D.E.;
RT   "Crystal structure of the apoptosis-inducing human granzyme A dimer.";
RL   Nat. Struct. Biol. 10:535-540(2003).
CC   -!- FUNCTION: Abundant protease in the cytosolic granules of cytotoxic
CC       T-cells and NK-cells which activates caspase-independent cell
CC       death with morphological features of apoptosis when delivered into
CC       the target cell through the immunological synapse. It cleaves
CC       after Lys or Arg. Cleaves APEX1 after 'Lys-31' and destroys its
CC       oxidative repair activity. Cleaves the nucleosome assembly protein
CC       SET after 'Lys-189', which disrupts its nucleosome assembly
CC       activity and allows the SET complex to translocate into the
CC       nucleus to nick and degrade the DNA. {ECO:0000269|PubMed:11555662,
CC       ECO:0000269|PubMed:12524539, ECO:0000269|PubMed:12628186,
CC       ECO:0000269|PubMed:16818237}.
CC   -!- CATALYTIC ACTIVITY:
CC       Reaction=Hydrolysis of proteins, including fibronectin, type IV
CC         collagen and nucleolin. Preferential cleavage: -Arg-|-Xaa-,
CC         -Lys-|-Xaa- >> -Phe-|-Xaa- in small molecule substrates.;
CC         EC=3.4.21.78;
CC   -!- SUBUNIT: Homodimer; disulfide-linked. Interacts with APEX1.
CC       {ECO:0000269|PubMed:12524539, ECO:0000269|PubMed:12819769,
CC       ECO:0000269|PubMed:12819770}.
CC   -!- SUBCELLULAR LOCATION: Isoform alpha: Secreted. Cytoplasmic
CC       granule.
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative promoter usage; Named isoforms=2;
CC       Name=alpha;
CC         IsoId=P12544-1; Sequence=Displayed;
CC       Name=beta;
CC         IsoId=P12544-2; Sequence=VSP_038571, VSP_038572;
CC   -!- INDUCTION: Dexamethasone (DEX) induces expression of isoform beta
CC       and represses expression of isoform alpha. The alteration in
CC       expression is mediated by binding of glucocorticoid receptor to
CC       independent promoters adjacent to the alternative first exons of
CC       isoform alpha and isoform beta. {ECO:0000269|PubMed:17180578}.
CC   -!- SIMILARITY: Belongs to the peptidase S1 family. Granzyme
CC       subfamily. {ECO:0000255|PROSITE-ProRule:PRU00274}.
CC   -!- CAUTION: Exons 1a and 1b of the sequence reported in
CC       PubMed:17180578 are of human origin, however exon 2 shows strong
CC       similarity to the rat sequence. {ECO:0000305}.
CC   -!- WEB RESOURCE: Name=Atlas of Genetics and Cytogenetics in Oncology
CC       and Haematology;
CC       URL="http://atlasgeneticsoncology.org/Genes/GZMAID51130ch5q11.html";
DR   EMBL; M18737; AAA52647.1; -; mRNA.
DR   EMBL; CR456968; CAG33249.1; -; mRNA.
DR   EMBL; AC091977; -; NOT_ANNOTATED_CDS; Genomic_DNA.
DR   EMBL; BC015739; AAH15739.1; -; mRNA.
DR   EMBL; AB284134; BAF56159.1; -; mRNA.
DR   EMBL; U40006; AAD00009.1; -; Genomic_DNA.
DR   CCDS; CCDS3965.1; -. [P12544-1]
DR   PIR; A31372; A31372.
DR   RefSeq; NP_006135.1; NM_006144.3.
DR   UniGene; Hs.90708; -.
DR   PDB; 1HF1; Model; -; A=29-262.
DR   PDB; 1OP8; X-ray; 2.50 A; A/B/C/D/E/F=29-262.
DR   PDB; 1ORF; X-ray; 2.40 A; A=29-262.
DR   PDBsum; 1HF1; -.
DR   PDBsum; 1OP8; -.
DR   PDBsum; 1ORF; -.
DR   ProteinModelPortal; P12544; -.
DR   SMR; P12544; -.
DR   BioGrid; 109256; 17.
DR   IntAct; P12544; 4.
DR   STRING; 9606.ENSP00000274306; -.
DR   ChEMBL; CHEMBL4307; -.
DR   MEROPS; S01.135; -.
DR   iPTMnet; P12544; -.
DR   PhosphoSitePlus; P12544; -.
DR   BioMuta; GZMA; -.
DR   DMDM; 317373360; -.
DR   MaxQB; P12544; -.
DR   PaxDb; P12544; -.
DR   PeptideAtlas; P12544; -.
DR   PRIDE; P12544; -.
DR   DNASU; 3001; -.
DR   Ensembl; ENST00000274306; ENSP00000274306; ENSG00000145649. [P12544-1]
DR   GeneID; 3001; -.
DR   KEGG; hsa:3001; -.
DR   UCSC; uc003jpm.4; human. [P12544-1]
DR   CTD; 3001; -.
DR   DisGeNET; 3001; -.
DR   GeneCards; GZMA; -.
DR   HGNC; HGNC:4708; GZMA.
DR   HPA; HPA054134; -.
DR   MIM; 140050; gene.
DR   neXtProt; NX_P12544; -.
DR   OpenTargets; ENSG00000145649; -.
DR   PharmGKB; PA29086; -.
DR   eggNOG; KOG3627; Eukaryota.
DR   eggNOG; COG5640; LUCA.
DR   GeneTree; ENSGT00760000118895; -.
DR   HOGENOM; HOG000251820; -.
DR   HOVERGEN; HBG013304; -.
DR   InParanoid; P12544; -.
DR   KO; K01352; -.
DR   OMA; KEFPYPC; -.
DR   OrthoDB; EOG091G0AH5; -.
DR   PhylomeDB; P12544; -.
DR   TreeFam; TF333630; -.
DR   BRENDA; 3.4.21.78; 2681.
DR   SIGNOR; P12544; -.
DR   EvolutionaryTrace; P12544; -.
DR   GeneWiki; GZMA; -.
DR   GenomeRNAi; 3001; -.
DR   PRO; PR:P12544; -.
DR   Proteomes; UP000005640; Chromosome 5.
DR   Bgee; ENSG00000145649; Expressed in 150 organ(s), highest expression level in leukocyte.
DR   CleanEx; HS_GZMA; -.
DR   Genevisible; P12544; HS.
DR   GO; GO:0005576; C:extracellular region; IEA:UniProtKB-SubCell.
DR   GO; GO:0001772; C:immunological synapse; TAS:UniProtKB.
DR   GO; GO:0005634; C:nucleus; TAS:UniProtKB.
DR   GO; GO:0042803; F:protein homodimerization activity; IDA:UniProtKB.
DR   GO; GO:0004252; F:serine-type endopeptidase activity; IDA:UniProtKB.
DR   GO; GO:0006915; P:apoptotic process; TAS:UniProtKB.
DR   GO; GO:0019835; P:cytolysis; IEA:UniProtKB-KW.
DR   GO; GO:0006955; P:immune response; TAS:UniProtKB.
DR   GO; GO:0043392; P:negative regulation of DNA binding; IDA:UniProtKB.
DR   GO; GO:0032078; P:negative regulation of endodeoxyribonuclease activity; IDA:UniProtKB.
DR   GO; GO:0051354; P:negative regulation of oxidoreductase activity; IDA:UniProtKB.
DR   GO; GO:0043065; P:positive regulation of apoptotic process; IDA:UniProtKB.
DR   GO; GO:0051603; P:proteolysis involved in cellular protein catabolic process; IDA:UniProtKB.
DR   CDD; cd00190; Tryp_SPc; 1.
DR   InterPro; IPR009003; Peptidase_S1_PA.
DR   InterPro; IPR001314; Peptidase_S1A.
DR   InterPro; IPR001254; Trypsin_dom.
DR   InterPro; IPR018114; TRYPSIN_HIS.
DR   InterPro; IPR033116; TRYPSIN_SER.
DR   Pfam; PF00089; Trypsin; 1.
DR   PRINTS; PR00722; CHYMOTRYPSIN.
DR   SMART; SM00020; Tryp_SPc; 1.
DR   SUPFAM; SSF50494; SSF50494; 1.
DR   PROSITE; PS50240; TRYPSIN_DOM; 1.
DR   PROSITE; PS00134; TRYPSIN_HIS; 1.
DR   PROSITE; PS00135; TRYPSIN_SER; 1.
PE   1: Evidence at protein level;
KW   3D-structure; Alternative promoter usage; Apoptosis;
KW   Complete proteome; Cytolysis; Direct protein sequencing;
KW   Disulfide bond; Glycoprotein; Hydrolase; Polymorphism; Protease;
KW   Reference proteome; Secreted; Serine protease; Signal; Zymogen.
FT   SIGNAL        1     26       In isoform alpha.
FT   PROPEP       27     28       Activation peptide (in isoform alpha).
FT                                {ECO:0000269|PubMed:3047119,
FT                                ECO:0000269|PubMed:3262682,
FT                                ECO:0000269|PubMed:3263427}.
FT                                /FTId=PRO_0000027393.
FT   CHAIN        29    262       Granzyme A.
FT                                /FTId=PRO_0000027394.
FT   DOMAIN       29    259       Peptidase S1. {ECO:0000255|PROSITE-
FT                                ProRule:PRU00274}.
FT   ACT_SITE     69     69       Charge relay system.
FT   ACT_SITE    114    114       Charge relay system.
FT   ACT_SITE    212    212       Charge relay system.
FT   CARBOHYD    170    170       N-linked (GlcNAc...) asparagine.
FT                                {ECO:0000269|PubMed:19159218}.
FT   DISULFID     54     70
FT   DISULFID    148    218
FT   DISULFID    179    197
FT   DISULFID    208    234
FT   VAR_SEQ       1     17       Missing (in isoform beta).
FT                                {ECO:0000303|PubMed:17180578}.
FT                                /FTId=VSP_038571.
FT   VAR_SEQ      18     23       LLLIPE -> MTKGLR (in isoform beta).
FT                                {ECO:0000303|PubMed:17180578}.
FT                                /FTId=VSP_038572.
FT   VARIANT     121    121       M -> T (in dbSNP:rs3104233).
FT                                {ECO:0000269|PubMed:15489334,
FT                                ECO:0000269|PubMed:3257574,
FT                                ECO:0000269|Ref.2}.
FT                                /FTId=VAR_024291.
FT   CONFLICT     33     34       NE -> DT (in Ref. 5; no nucleotide
FT                                entry). {ECO:0000305}.
FT   CONFLICT     36     36       T -> V (in Ref. 5; no nucleotide entry).
FT                                {ECO:0000305}.
FT   CONFLICT     47     47       S -> K (in Ref. 5; no nucleotide entry).
FT                                {ECO:0000305}.
FT   CONFLICT     49     52       DRKT -> KPDS (in Ref. 5; no nucleotide
FT                                entry). {ECO:0000305}.
FT   CONFLICT     62     62       D -> N (in Ref. 5; no nucleotide entry).
FT                                {ECO:0000305}.
FT   CONFLICT     71     72       NL -> IP (in Ref. 5; no nucleotide
FT                                entry). {ECO:0000305}.
FT   STRAND       43     47       {ECO:0000244|PDB:1ORF}.
FT   STRAND       49     51       {ECO:0000244|PDB:1ORF}.
FT   STRAND       53     60       {ECO:0000244|PDB:1ORF}.
FT   STRAND       63     66       {ECO:0000244|PDB:1ORF}.
FT   STRAND       77     81       {ECO:0000244|PDB:1ORF}.
FT   STRAND       83     87       {ECO:0000244|PDB:1ORF}.
FT   STRAND       93     95       {ECO:0000244|PDB:1ORF}.
FT   STRAND       97    102       {ECO:0000244|PDB:1ORF}.
FT   TURN        108    110       {ECO:0000244|PDB:1ORF}.
FT   STRAND      116    122       {ECO:0000244|PDB:1ORF}.
FT   STRAND      127    130       {ECO:0000244|PDB:1ORF}.
FT   STRAND      147    154       {ECO:0000244|PDB:1ORF}.
FT   STRAND      156    160       {ECO:0000244|PDB:1ORF}.
FT   STRAND      167    174       {ECO:0000244|PDB:1ORF}.
FT   HELIX       176    179       {ECO:0000244|PDB:1ORF}.
FT   TURN        182    189       {ECO:0000244|PDB:1ORF}.
FT   STRAND      195    199       {ECO:0000244|PDB:1ORF}.
FT   STRAND      215    218       {ECO:0000244|PDB:1ORF}.
FT   STRAND      221    228       {ECO:0000244|PDB:1ORF}.
FT   STRAND      241    245       {ECO:0000244|PDB:1ORF}.
FT   HELIX       248    258       {ECO:0000244|PDB:1ORF}.
SQ   SEQUENCE   262 AA;  28999 MW;  FD773628BA6F301B CRC64;
     MRNSYRFLAS SLSVVVSLLL IPEDVCEKII GGNEVTPHSR PYMVLLSLDR KTICAGALIA
     KDWVLTAAHC NLNKRSQVIL GAHSITREEP TKQIMLVKKE FPYPCYDPAT REGDLKLLQL
     MEKAKINKYV TILHLPKKGD DVKPGTMCQV AGWGRTHNSA SWSDTLREVN ITIIDRKVCN
     DRNHYNFNPV IGMNMVCAGS LRGGRDSCNG DSGSPLLCEG VFRGVTSFGL ENKCGDPRGP
     GVYILLSKKH LNWIIMTIKG AV
//

AA (Amino Acid) codes

回答1:

Solution:

You should replace the following:

lexer(Tokens) -->
   white_space,
   (
      (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

with

lexer(Tokens) -->
   white_space,
   (
      (
         op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
      ;
         digit(D),  !, number(D, N),
         { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
         {  member((Id, Token), [ (div, tokDiv),
                                 (mod, tokMod),
                                 (where, tokWhere)]),
            !
      ;  Token = tokVar(Id)
         }
      ;  [_],
         { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;
      [],
      { Tokens = [] }
   ).

%%%
op_token(tokColon)      --> ";".
op_token(tokLParen)     --> "(".
op_token(tokRParen)     --> ")".
op_token(tokLMusta)     --> "{".
op_token(tokRMusta)     --> "}".
op_token(tokBackSlash)  --> "\\".
op_token(tokImpl)       --> "->".
op_token(tokPlus)       --> "+".
op_token(tokMinus)      --> "-".
op_token(tokTimes)      --> "*".
op_token(tokEqual)      --> "=".
op_token(tokLt)         --> "<".
op_token(tokGt)         --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod)     --> ".".
op_token(tokSlash)      --> "/".
op_token(tokComma)      --> ",".
op_token(tokSemicolon)  --> ";".

Edit by Guy Coder

I ran a test using the example data posted in the question into a list where each item in the list was a line in the data converted to character codes. Then with time/1 called lexer on each item in the list and repeated the test for the list 10000 times. The reason the data was loaded into a list and converted to characters codes before time/1 was so that those processes did not skew the results. Each of these runs was repeated 5 times to get a consistency of data.

In the following runs below, for all of the different versions the lexer was extended to cover all of the 7-bit ASCII characters which significantly increased the number of cases for special characters.

For the version in the question.

Version: 1

:- set_prolog_flag(double_quotes,chars).

% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)

For the version as posted above in this answer

Version: 2

:- set_prolog_flag(double_quotes,chars).

% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)

Version 2 gives a dramatic improvement by using indexing from Version 1.

In doing further research on the code, upon looking at op_token which is DCG and has two hidden variables for implicitly passing around a state representation, using listing/1 showed:

op_token(tokUnderscore,['_'|A], A).

Notice that the first parameter is not the character being searched and that in this answer the indexing code is written as

c_digit(0'0,0).

where the first parameter is the character being searched and the second parameter is the result.

So change this

op_token(Token), !

to this

[S], { special_character_indexed(S,Token) }

with indexed clauses as

special_character_indexed( ';' ,tokSemicolon).

Version: 3

:- set_prolog_flag(double_quotes,chars).

% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)

Version 3 gives a slightly better but consistently better result than Version 2.

Lastly just changing double_quotes flag to atom as noted in a comment by AntonDanilov

Version: 4

:- set_prolog_flag(double_quotes,atom).

% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)

Version 4 is almost the same as Version 3.

Just looking at CPU numbers, using indexing is faster, e.g. (Version: 1) 151.875 vs (Version: 3) 74.547

回答2:

One thing it means is that this is silly code:

token(T) -->
    ( "1", !, { T = one }
    ; "2", !, { T = two }
    ; "3", !, { T = three }
    )

This is less silly code:

token(T) --> one_two_three(T).

one_two_three(one) --> "1".
one_two_three(two) --> "2".
one_two_three(three) --> "3".

But still not so good. Maybe better:

token(T) --> [X], { one_two_three(X, T) }.

one_two_three(0'1, one).
one_two_three(0'2, two).
one_two_three(0'3, three).

Last example also starts to look silly but remember that now you have indexing on first argument. You read once, no choice point, no backtrack.

But if you want to really know how to write efficient you need to measure where the time and space goes. Have you measured?

But if you really want to know how to fix you maybe read "Craft of Prolog", I do not understand all of this book but I remember it had big section on DCG.

But if you really want to parse such formats large files maybe find existing libraries in other languages, it might be much faster than fastest Prolog.