phpbb BBCode to HTML (regex or otherwise)

2019-08-13 01:20发布

I'm in the process of migrating content from phpBB to WordPress. I have suceeded up to the point of translating the bbcode into html.

The BBCode is complicated by an alphanumeric string that is injected into each tag.

A common post will contain text like so...

[url=url] Click here [/url:583ow9wo]

[b:583ow9wo] BOLD [/b:583ow9wo]

[img:583ow9wo] jpg [/img:583ow9wo]

I am inexperienced with Regular Expressions but believe this may be a way out, as I found some help from the following post https://stackoverflow.com/a/5505874/4356865 (use regex [/?b:\d{5}] ) but the regex in this instance will only remove the numeric characters from this example.

Any help appreciated.

1条回答
家丑人穷心不美
2楼-- · 2019-08-13 02:13

Something like this will work for tags that have no attributes:

\[(b|i|u)(:[a-z0-9]+)?\](.*?)\[\/\1(?:\2)?\]

\[               -- matches literal "[" 
  (b|i|u)        -- matches b, i, or u, captures as backreference 1
  (:[a-z0-9]+)?  -- matches colon and then alphanumeric string, captures as backreference 2
                 -- the question mark allows the :string not to be present.
\]               -- matches literal "]"
(.*?)            -- matches anything*, as few times as required to finish the match, creates backreference 3.
\[               -- matches literal "["
  \/             -- matches literal "/"
  \1             -- invokes backreference 1 to make sure the opening/closing tags match
  (?:\2)?        -- invokes backreference 2 to further make sure it's the same tag
\]               -- matches literal "]"

Matching a tag like url is easy enough

With tags that have attributes, they do different things with their attributes, and so it's probably easier to handle a tag like URL seperately from a tag like IMG.

\[(url)(?:\s*=\s*(.*?))?(:[a-z0-9]+)\](.*?)\[\/\1(?:\3)?\]

\[                    -- matches literal "["
  (url)               -- matches literal "url", in parentheses so we can invoke backreference 1 later, easier for you to modify
  (?:                 -- ?: signifies a non-capturing group, so it creates a group without creating a backreference, or altering the backreference count.
    \s*=\s*           -- matches literal "=", padded by any amount of whitespace on either side
    (.*?)             -- matches any character, as few times as possible, to complete the match, creates backreference 2
  )                   -- closes the noncapturing group
  (:[a-z0-9]+)        -- matches the alphanumeric string as backreference 3.
\]                    -- matches literal "]"
(.*?)                 -- matches any character as few times as possible to complete the match, backreference 4
\[                    -- matches literal "["
  \/                  -- matches literal "/"
  \1                  -- invokes backreference 1
  (?:\3)?             -- invokes backreference 3
\]                    -- matches literal "["

For your replacing, the contents of the tags are in backreferences themselves so you can do something like this for the b/i/u tags.

<\1>\3</\1>

For the url tag, it's something like this

<A href="\2">\4</A>

I say that the dot/period matches any character in multiple places. It matches any character except newline. You can turn the newline modifier in your regex on by using the "dotall" modifier s like this

/(.*)<foo>/s
查看更多
登录 后发表回答