RegEx for “fixing” e-mail headers, making them a s

2019-09-21 11:09发布

问题:

Possible Duplicate:
How to do unfolding RFC 822
Parsing e-mail-like headers (similar to RFC822)

I have some input data that is similar to e-mail data, in that long lines are wrapped to the next line. For example:

robot-useragent: ABCdatos BotLink/1.0.2 (test links)
robot-language: basic
robot-description: This robot is used to verify availability of the ABCdatos
                   directory entries (http://www.abcdatos.com), checking
                   HTTP HEAD. Robot runs twice a week. Under HTTP 5xx
                   error responses or unable to connect, it repeats
                   verification some hours later, verifiying if that was a
                   temporary situation.

The robot-description field is "too long" for one line, and is wrapped to the next. For aid in parsing this data, I would like to come up with a RegEx that can be used with preg_replace() to replace with the following conditions:

  • New line characters followed by whitespace
  • Not replacing new line characters followed by additional new line characters

Example output:

robot-description: This robot is used to verify availability of the ABCdatos directory entries (http://www.abcdatos.com), checking HTTP HEAD. Robot runs twice a week. Under HTTP 5xx error responses or unable to connect, it repeats verification some hours later, verifiying if that was a temporary situation.

I am new to RegEx. How can I build such an expression? If you choose to answer, please include a brief explanation of the components in the expression. I'd really like to learn how to do these.

I've started with this: \n([^\S])* It is close. http://codepad.org/iMObpgFX

回答1:

Maybe you could try:

(\r|\n)\s+

(\r|\n) # matches both newline and carriage return 
\s+     # any whitespace (tabs, spaces, new lines)

Try it!