parsings strings: extracting words and phrases [Ja

2019-01-12 03:47发布

I need to support exact phrases (enclosed in quotes) in an otherwise space-separated list of terms. Thus splitting the respective string by the space-character is not sufficient anymore.

Example:

input : 'foo bar "lorem ipsum" baz'
output: ['foo', 'bar', 'lorem ipsum', 'baz']

I wonder whether this could be achieved with a single RegEx, rather than performing complex parsing or split-and-rejoin operations.

Any help would be greatly appreciated!

10条回答
孤傲高冷的网名
2楼-- · 2019-01-12 04:12

One that's easy to understand and a general solution. Works for all delimiters and 'join' characters. Also supports 'joined' words that are more than two words in length.... ie lists like

"hello my name is 'jon delaware smith fred' I have a 'long name'"....

A bit like the answer by AC but a bit neater...

function split(input, delimiter, joiner){
    var output = [];
    var joint = [];
    input.split(delimiter).forEach(function(element){
        if (joint.length > 0 && element.indexOf(joiner) === element.length - 1)
        {
            output.push(joint.join(delimiter) + delimiter + element);
            joint = [];
        }
        if (joint.length > 0 || element.indexOf(joiner) === 0)
        {
            joint.push(element);
        }
        if (joint.length === 0 && element.indexOf(joiner) !== element.length - 1)
        {
            output.push(element);
            joint = [];
        }
    });
    return output;
  }
查看更多
地球回转人心会变
3楼-- · 2019-01-12 04:13
var str = 'foo bar "lorem ipsum" baz';  
var results = str.match(/("[^"]+"|[^"\s]+)/g);

... returns the array you're looking for.
Note, however:

  • Bounding quotes are included, so can be removed with replace(/^"([^"]+)"$/,"$1") on the results.
  • Spaces between the quotes will stay intact. So, if there are three spaces between lorem and ipsum, they'll be in the result. You can fix this by running replace(/\s+/," ") on the results.
  • If there's no closing " after ipsum (i.e. an incorrectly-quoted phrase) you'll end up with: ['foo', 'bar', 'lorem', 'ipsum', 'baz']
查看更多
Melony?
4楼-- · 2019-01-12 04:14

how about,

output = /(".+?"|\w+)/g.exec(input)

then do a pass on output to lose the quotes.

alternately,

output = /"(.+?)"|(\w+)/g.exec(input)

then do a pass n output to lose the empty captures.

查看更多
疯言疯语
5楼-- · 2019-01-12 04:14

Thanks a lot for the quick responses!

Here's a summary of the options, for posterity:

var input = 'foo bar "lorem ipsum" baz';

output = input.match(/("[^"]+"|[^"\s]+)/g);
output = input.match(/"[^"]*"|\w+/g);
output = input.match(/("[^"]*")|([^\s"]+)/g)
output = /(".+?"|\w+)/g.exec(input);
output = /"(.+?)"|(\w+)/g.exec(input);

For the record, here's the abomination I had come up with:

var input = 'foo bar "lorem ipsum" "dolor sit amet" baz';
var terms = input.split(" ");

var items = [];
var buffer = [];
for(var i = 0; i < terms.length; i++) {
    if(terms[i].indexOf('"') != -1) { // outer phrase fragment -- N.B.: assumes quote is either first or last character
        if(buffer.length === 0) { // beginning of phrase
            //console.log("start:", terms[i]);
            buffer.push(terms[i].substr(1));
        } else { // end of phrase
            //console.log("end:", terms[i]);
            buffer.push(terms[i].substr(0, terms[i].length - 1));
            items.push(buffer.join(" "));
            buffer = [];
        }
    } else if(buffer.length != 0) { // inner phrase fragment
        //console.log("cont'd:", terms[i]);
        buffer.push(terms[i]);
    } else { // individual term
        //console.log("standalone:", terms[i]);
        items.push(terms[i]);
    }
    //console.log(items, "\n", buffer);
}
items = items.concat(buffer);

//console.log(items);
查看更多
爱情/是我丢掉的垃圾
6楼-- · 2019-01-12 04:24

ES6 solution supporting:

  • Split by space except for inside quotes
  • Removing quotes but not for backslash escaped quotes
  • Escaped quote become quote

Code:

input.match(/\\?.|^$/g).reduce((p, c) => {
        if(c === '"'){
            p.quote ^= 1;
        }else if(!p.quote && c === ' '){
            p.a.push('');
        }else{
            p.a[p.a.length-1] += c.replace(/\\(.)/,"$1");
        }
        return  p;
    }, {a: ['']}).a

Output:

[ 'foo', 'bar', 'lorem ipsum', 'baz' ]
查看更多
疯言疯语
7楼-- · 2019-01-12 04:25

If you are just wondering how to build the regex yourself, you might want to check out Expresso (Expresso link). It's a great tool to learn how to build regular expressions so you know what the syntax means.

When you've built your own expression, then you can perform a .match on it.

查看更多
登录 后发表回答