How to correctly reformat dimension values that ha

2019-07-16 12:10发布

I'm trying to create an Excel macro that formats dimension values into our company's format. That way we can easily import the data into our systems without having to do thousands of dimensions manually. There are a few problems I'm running into though:

  1. There are many variations of dimensions that suppliers send us, which makes it difficult for me to come up with some sort of regex to capture all of the values.
  2. Even if I were able to come up with some sort of regex to process the values, I'm not sure how I'd replace the values with the correct form since I'm not sure if it's possible to replace a regex match with a captured regex group value. If it is, I have not clue how I'd go about it with the situation.

Our company's standard format for Dimensions is as follows:

Can have up to 3 parameters per value

Attribute1:Value1:Unit1;Attribute2:Value2:Unit2;Attribute3:Value3:Unit3

Example: 1" L x 2" W x 3" H Translates to Length:1:in;Width:2:in;Height:3:in

Possible Values that are able to be used with it are:

  • Length
  • Width
  • Height
  • Arc
  • Area
  • Circumference
  • Depth
  • Dia
  • Thickness

Some of the variations I've noted in the past year include:

  • Length - L or L.
  • Width - W or W.
  • Height - H, H., Heigth
  • Circumference - Round
  • Depth - D, D., Deep
  • Dia - Diameter or Dia.
  • Thickness - Thick
  • in - inch, inches, in., ", ''(2 apostrophes)
  • ft - feet

A small sample of Product Dimensions (Note the inconsistencies):

3 3/4" Width x 2 1/2" Height
L 4 3/4" x W 1 1/2" x H 3"
3 1/2" W x 2 1/8" H x 2 7/8" D
3 5/8" W x 2 1/2" H x 5/8" Depth
3 3/4" W x 1" H
1 1/4" W x 3 1/4" H
2 3/8" Diameter
3" W x 2 1/2" H
2" W x 3" H
2 1/2" W x 2" H
1 3/8" W x 2 1/8" H
3 1/2" W x 3 1/2" H
1 1/2" W x 3" H
2" W x 1 7/8" H x 1 1/2" D
4 3/4" W x 3 1/2" H
4 3/4" W x 4" H x 1 1/4" D
3 1/2" W x 3 1/2" H x 3 1/2" D
3-1/2" W x 2-3/4" H 
3.5" W x 4" H
3" H
3 1/4" W x 2 1/4" H
4 7/16" W x 6 1/4" H
3 1/4" W x 3 1/4" H
5" W x 7" H

So far, I've come up with the regex (\d+(.| |/|)\d+((/)\d+|)|\d+), which appears to pick up all the numbers, but I'm not quite sure how I would go about finding all the different variations of the attributes and units. The only thing I thought might work would be lookbehinds & lookaheads, but I'm not well-versed enough with this flavor of regex to figure it.

Question 1: Is regex the best way to go about this task or is there a better way?

Question 2: The ultimate question, How can I accomplish this complex task or is it even remotely possible using vba?

1条回答
2楼-- · 2019-07-16 12:26

You can create a parser / render, the below example shows how that may be implemented in EBNF parser based on RegEx, put the code into standard VBA module:

Option Explicit

Private sBuffer As String
Private oTokens As Object
Private oRegEx As Object

Sub TestParserRender()

    Dim sScr As String
    Dim sResult As String

    sScr = ReadTextFile(ThisWorkbook.Path & "\Source.txt", -2)
    sResult = Parse(sScr)
    WriteTextFile sResult, ThisWorkbook.Path & "\Result.txt", -1

End Sub

Function Parse(ByVal sSample As String) As String

    ' Init
    sBuffer = sSample
    Set oTokens = CreateObject("Scripting.Dictionary")
    Set oRegEx = CreateObject("VBScript.RegExp")
    With oRegEx
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        ' Cast variations in attributes and units
        .Pattern = "\bL\.(?=\s|$)|\bL\b"
        sBuffer = .Replace(sBuffer, "Length")
        .Pattern = "\bW\.(?=\s|$)|\bW\b"
        sBuffer = .Replace(sBuffer, "Width")
        .Pattern = "\bH\.(?=\s|$)|\bH\b|\bHeigth\b"
        sBuffer = .Replace(sBuffer, "Height")
        .Pattern = "\bRound\b"
        sBuffer = .Replace(sBuffer, "Circumference")
        .Pattern = "\bD\.(?=\s|$)|\bD\b|\bDeep\b"
        sBuffer = .Replace(sBuffer, "Depth")
        .Pattern = "\bDia\.(?=\s|$)|\bDiameter\b"
        sBuffer = .Replace(sBuffer, "Dia")
        .Pattern = "\bThick\b"
        sBuffer = .Replace(sBuffer, "Thickness")
        .Pattern = "(?:\""|'')(?=\s|$)"
        sBuffer = .Replace(sBuffer, " in")
        .Pattern = "\binch\b|\binches\b|\bin\.(?=\s|$)"
        sBuffer = .Replace(sBuffer, "in")
        .Pattern = "\bfeet\b"
        sBuffer = .Replace(sBuffer, "ft")
        ' Tokenize instances
        .Pattern = "<\d+[savedpun]>"
        Tokenize "e" ' Escape reserved sequence
        .Pattern = "\b(?:\d+((?:[ -]\d+)?(?:\/|\.)\d+)?)(?=\D)"
        Tokenize "n" ' Number
        .Pattern = "\b(?:Length|Width|Height|Arc|Area|Circumference|Depth|Dia|Thickness)\b"
        Tokenize "a" ' Attribute
        .Pattern = "\b(?:in|ft)\b"
        Tokenize "u" ' Units
        .Pattern = "<\d+n>[ \t]*<\d+u>"
        Tokenize "v" ' Number + Unit = Value
        .Pattern = "(<\d+v>)([ \t]*)(<\d+a>)"
        sBuffer = .Replace(sBuffer, "$3$2$1") ' Swap Value + Attribute = Attribute + Value
        .Pattern = "<\d+a>[ \t]*<\d+v>"
        Tokenize "p" ' Attribute + Value = Parameter
        .Pattern = "^[ \t]*<\d+p>(?:[ \t]*X[ \t]*<\d+p>){0,2}[ \t]*$"
        Tokenize "d" ' Parameter X Parameter X Parameter = Dimension
        .MultiLine = False
        .Pattern = "^(?:\r\n)*<\d+d>(?:(?:\r\n)+<\d+d>)*(?:\r\n)*$"
        Tokenize "s" ' Dimension * N times = Structure
        .Pattern = "^<\d+s>$" ' Top level Structure single element
        If .Test(sBuffer) And oTokens.Exists(sBuffer) Then
            Parse = Retrieve(sBuffer) ' Render if success
        Else
            MsgBox "Parsing failed"
            .Pattern = "^([\s\S]+?)(<\d+[savedpun]>)"
            sBuffer = .Replace(sBuffer, "[$1]$2") ' Put failed from begin in brackets
            .Pattern = "(<\d+[savedpun]>)([\s\S]+?)(?=<\d+[savedpun]>|$)"
            sBuffer = .Replace(sBuffer, "$1[$2]") ' Put failed between tokens in brackets
            .Pattern = "\[\r\n\]"
            sBuffer = .Replace(sBuffer, vbCrLf) ' Recover dummy new lines in brackets
            .Global = False
            .Pattern = "<\d+[savedpun]>" ' Retrieve the rest tokens
            Do
                With .Execute(sBuffer)
                    If .Count = 0 Then Exit Do
                    sBuffer = Replace(sBuffer, .Item(0).value, oTokens(.Item(0).value))
                End With
            Loop
            Parse = sBuffer
        End If
    End With
    Set oTokens = Nothing
    Set oRegEx = Nothing

End Function

Private Sub Tokenize(sType)

    Dim aContent() As String
    Dim lCopyIndex As Long
    Dim i As Long
    Dim sKey As String

    With oRegEx.Execute(sBuffer)
        If .Count = 0 Then Exit Sub
        ReDim aContent(0 To .Count - 1)
        lCopyIndex = 1
        For i = 0 To .Count - 1
            With .Item(i)
                sKey = "<" & oTokens.Count & sType & ">"
                oTokens(sKey) = .value
                aContent(i) = Mid(sBuffer, lCopyIndex, .FirstIndex - lCopyIndex + 1) & sKey
                lCopyIndex = .FirstIndex + .Length + 1
            End With
        Next
    End With
    sBuffer = Join(aContent, "") & Mid(sBuffer, lCopyIndex, Len(sBuffer) - lCopyIndex + 1)

End Sub

Private Function Retrieve(sTokenKey As String) As String

    Dim sTokenValue As String
    Dim aTokens() As String
    Dim i As Long
    Dim aContent() As String

    sTokenValue = oTokens(sTokenKey)
    Select Case Left(Right(sTokenKey, 2), 1)
        Case "s", "d"
            aTokens = Split(sTokenValue, "<")
            ReDim aContent(UBound(aTokens) - 1)
            For i = 1 To UBound(aTokens)
                aContent(i - 1) = Retrieve("<" & Split(aTokens(i), ">", 2)(0) & ">")
            Next
            Retrieve = Join(aContent, IIf(Left(Right(sTokenKey, 2), 1) = "s", vbCrLf, ";"))
        Case "p", "v"
            aTokens = Split(sTokenValue, "<")
            Retrieve = _
                Retrieve("<" & Split(aTokens(1), ">", 2)(0) & ">") & _
                ":" & _
                Retrieve("<" & Split(aTokens(2), ">", 2)(0) & ">")
        Case "a", "u", "n"
            Retrieve = sTokenValue
    End Select

End Function

Function ReadTextFile(sPath As String, lFormat As Long) As String
    ' lFormat -2 - System default, -1 - Unicode, 0 - ASCII
    With CreateObject("Scripting.FileSystemObject").OpenTextFile(sPath, 1, False, lFormat)
        ReadTextFile = ""
        If Not .AtEndOfStream Then ReadTextFile = .ReadAll
        .Close
    End With
End Function

Sub WriteTextFile(sContent As String, sPath As String, lFormat As Long)
    With CreateObject("Scripting.FileSystemObject").OpenTextFile(sPath, 2, True, lFormat)
        .Write (sContent)
        .Close
    End With
End Sub

Save a sample as ANSI or Unicode to text file Source.txt in the same folder as Excel file, and run TestParserRender(). The output will be saved to text file Result.txt. Processing starts from parsing. Variations in attributes and units are cast by RegEx replacement first. Then matched to the RegEx patterns parts are folded into tokens. Wrong Value + Attribute sequences are rectified with RegEx sub-matches swapping by replacement. At the end of the parsing the single top level Structure token should left, otherwise the error is raised. If parsing is failed, unrecognized parts are put into braces in output. If it's success then the reverse process of content retrieving with rendering continues up to the last token.

The parsing algorithm in outline can be represented by the EBNF grammar below (simplified, replacement not shown):

structure ::= ( "\n\r" )* dimension ( ( "\n\r" )+ dimension )* ( "\n\r" )*
dimension ::= ( " " | "\t" )* parameter ( ( " " | "\t" )+ "X" ( " " | "\t" )+ parameter )? ( ( " " | "\t" )+ "X" ( " " | "\t" )+ parameter )? ( " " | "\t" )*
parameter ::= attribute ( " " | "\t" )* value
attribute ::= "\b" ( "Length" | "Width" | "Height" | "Arc" | "Area" | "Circumference" | "Depth" | "Dia" | "Thickness" ) "\b"
value ::= number ( " " | "\t" ) unit
number ::= digits ( ( ( ( ' ' | '-' ) digits )? '/' | '.' ) digits )?
digits ::= digit+
digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
unit ::= "\b" ( "in" | "ft" ) "\b"

and related diagram:

diagram

The output for the sample you provided is as follows:

Width:3 3/4:in;Height:2 1/2:in
Length:4 3/4:in;Width:1 1/2:in;Height:3:in
Width:3 1/2:in;Height:2 1/8:in;Depth:2 7/8:in
Width:3 5/8:in;Height:2 1/2:in;Depth:5/8:in
Width:3 3/4:in;Height:1:in
Width:1 1/4:in;Height:3 1/4:in
Dia:2 3/8:in
Width:3:in;Height:2 1/2:in
Width:2:in;Height:3:in
Width:2 1/2:in;Height:2:in
Width:1 3/8:in;Height:2 1/8:in
Width:3 1/2:in;Height:3 1/2:in
Width:1 1/2:in;Height:3:in
Width:2:in;Height:1 7/8:in;Depth:1 1/2:in
Width:4 3/4:in;Height:3 1/2:in
Width:4 3/4:in;Height:4:in;Depth:1 1/4:in
Width:3 1/2:in;Height:3 1/2:in;Depth:3 1/2:in
Width:3-1/2:in;Height:2-3/4:in
Width:3.5:in;Height:4:in
Height:3:in
Width:3 1/4:in;Height:2 1/4:in
Width:4 7/16:in;Height:6 1/4:in
Width:3 1/4:in;Height:3 1/4:in
Width:5:in;Height:7:in

BTW I used the same approach in VBA JSON parser.

查看更多
登录 后发表回答