Javascript and Regex: Get index of a captured stri

2019-09-17 21:55发布

问题:

Here's is my problem:

  • I have a regular expression, this expression contains one, and only one capture group,
  • This regular expression cannot be changed,
  • I have a string, that will be matched this regular expression,
  • The regex will match the complete string, it's not a look-up, if the regex cannot be matched to the string, the function will fail prior reaching this step.

=> I want to get the captured sub-string position in the string, and it's length.

Example;

If my regex is

^.*?\/F?L?(\d+)$

my string is

"( 413) 250/FL250"

I want to get 14, and 3.

In those conditions, search would return 1.

This is a simple example, but we can have extremely complex regex, however the principle is always the same: one and only one capture group, and find the position of the captured string in the main one.

Thanks a lot for your help, I'm stucked.

EDITION:

So I made something with ant (our base work environnement is ant) which consist of getting the leftContext of the capture group, then determine it's size. To get the leftContext, I simply move the parenthesis of the capture groupe at the left part. Ex: \d(\s) becomes (\d)\s.

So there I have a question about it:

<macrodef name="Get_CaptureGroup_Position" >
    <attribute name="text" />
    <attribute name="mask" />
    <attribute name="start" />
    <attribute name="end" />
    <sequential>

        <var name="_GMLCS_modified_regex"       unset="true"/>
        <var name="_GMLCS_leftContext"          unset="true"/>
        <var name="_GMLCS_leftContext_len"      unset="true"/>
        <var name="_GMLCS_CapturedGroup"        unset="true"/>
        <var name="_GMLCS_CapturedGroup_len"    unset="true"/>

        <propertyregex property="_GMLCS_modified_regex" override="yes"  input="@{mask}" regexp="(.*[^\\])\)([^?].*)" replace="\1\2" />  
        <propertyregex property="_GMLCS_modified_regex" override="yes" input="${_GMLCS_modified_regex}" regexp="(.*[^\\])\(([^?].*)" replace="\1)\2" />
        <var name="_GMLCS_modified_regex" value="(${_GMLCS_modified_regex}" />

        <propertyregex property="_GMLCS_leftContext"    override="yes" input="@{text}" regexp="${_GMLCS_modified_regex}" select="\1" />
        <propertyregex property="_GMLCS_CapturedGroup"  override="yes" input="@{text}" regexp="@{mask}" select="\1" />

        <getAttributeLength text="${_GMLCS_leftContext}"    property="_GMLCS_leftContext_len" />
        <getAttributeLength text="${_GMLCS_CapturedGroup}"  property="_GMLCS_CapturedGroup_len" />

        <math result="_GMLCS_leftContext_len"   operation="+" operand1="${_GMLCS_leftContext_len}" operand2="1" />
        <math result="_GMLCS_CapturedGroup_len" operation="+" operand1="${_GMLCS_leftContext_len}" operand2="${_GMLCS_CapturedGroup_len}" />

        <var name="@{start}" value="${_GMLCS_leftContext_len}" />
        <var name="@{end}" value="${_GMLCS_CapturedGroup_len}" />

        <var name="_GMLCS_modified_regex"       unset="true"/>
        <var name="_GMLCS_leftContext"          unset="true"/>
        <var name="_GMLCS_leftContext_len"      unset="true"/>
        <var name="_GMLCS_CapturedGroup"        unset="true"/>
        <var name="_GMLCS_CapturedGroup_len"    unset="true"/>
    </sequential>
</macrodef>

My question is that, when I pass this regex:

(?:A|.*)/F?L?(\d+)\s*\d*(?:A|.*)

I get:

First property regex:

(?:A|.*)/F?L?(\d+\s*\d*(?:A|.*) = CORRECT

Second propoerty regex:

(?:A|.*)/F?L?)\d+\s*\d*(?:A|.*) = CORRECT

Var:

((?:A|.*)/F?L?)\d+\s*\d*(?:A|.*) = CORRECT

Start and End: 7 and 10 = CORRECT.

This is actually correct, but I believe it should not be, my question is why the ")" at the end of (?:...) blocks were not removed ?

回答1:

Here the final answer we have for our issue. It's done by ANT, but I think it is transposable to javascript:

<macrodef name="Get_CaptureGroup_Position" >
<attribute name="text" />
<attribute name="mask" />
<attribute name="start" />
<attribute name="end" />
<sequential>

    <var name="_GMLCS_modified_regex"       unset="true"/>
    <var name="_GMLCS_leftContext"          unset="true"/>
    <var name="_GMLCS_leftContext_len"      unset="true"/>
    <var name="_GMLCS_CapturedGroup"        unset="true"/>
    <var name="_GMLCS_CapturedGroup_len"    unset="true"/>

    <propertyregex property="_GMLCS_modified_regex" override="yes" input="@{mask}" regexp="^((?:|(?:[^\\]|\\.)*))\(([^?].*)$" replace="(\1\2" />

    <propertyregex property="_GMLCS_leftContext"    override="yes" input="@{text}" regexp="${_GMLCS_modified_regex}" select="\1" />
    <propertyregex property="_GMLCS_CapturedGroup"  override="yes" input="@{text}" regexp="@{mask}" select="\1" />

    <getAttributeLength text="${_GMLCS_leftContext}"    property="_GMLCS_leftContext_len" />
    <getAttributeLength text="${_GMLCS_CapturedGroup}"  property="_GMLCS_CapturedGroup_len" />

    <math result="@{start}" operation="-" operand1="${_GMLCS_leftContext_len}" operand2="${_GMLCS_CapturedGroup_len}" datatype="int"/>
    <math result="@{start}" operation="+" operand1="${@{start}}" operand2="1" datatype="int"/>
    <var name="@{end}" value="${_GMLCS_leftContext_len}" />

    <var name="_GMLCS_modified_regex"       unset="true"/>
    <var name="_GMLCS_leftContext"          unset="true"/>
    <var name="_GMLCS_leftContext_len"      unset="true"/>
    <var name="_GMLCS_CapturedGroup"        unset="true"/>
    <var name="_GMLCS_CapturedGroup_len"    unset="true"/>
</sequential>



回答2:

It is trivial to get the length as shown in the 2 methods below, but it is impossible in general case to get the start and end index of the text captured by a capturing group.

The first method with String.match, for non-global RegExp only:

// reNonGlobal can be a variable containing RegExp object
// or a RegExp object directly specified.
var result = inputString.match(reNonGlobal);

if (result != null) {
    console.log(result[groupNumber].length);
}

The second method with RegExp.exec, for any RegExp:

var arr;
// The RegExp object must be assigned to a variable
var re = ...;

if (re.global) {
    while ((arr = re.exec(inputString)) != null) {
        console.log(arr[groupNumber].length);

        // lastIndex is not advanced when empty string is matched
        // Need to manually advance it to prevent infinite loop
        if (arr[0].length == 0) {
            re.lastIndex += 1;
        }
    }
} else {
    if ((arr = re.exec(inputString)) != null) {
        console.log(arr[groupNumber].length);
    }
}

Using indexOf (or any other method) to locate the index of the captured text is unreliable, and dependent on particular regex and/or input.