I have to parse some STEP-files (ISO-10303-21) from different CAD-Systems and they are always structured differently. This are the forms that might appear:
#95=STYLED_ITEM('',(#94),#92);
#12 = CARTESIAN_POINT ( 'NONE', ( 1.213489432997839200,
5.617300827691964000, -7.500000000000001800 ) ) ;
#263 = TEST ( 'Spaces must not be ignored here' ) ;
I thought that a regular expression would help me, so I created this one (http://rubular.com/r/EtJ25Hfg77):
(\#\d+)\s*=\s*([A-Z_]+)\s*\(\s*(.*)*\s*\)\s*;
This gives me:
Match 1:
1: #95
2: STYLED_ITEM
3:
Match 2:
1: #12
2: CARTESIAN_POINT
3:
Match 3:
1: #263
2: TEST
3:
So the first two groups are working as supposed. But I also need the attributes inside the parantheses like this:
Match 1:
1: #95
2: STYLED_ITEM
3: ''
4: (#94)
5: #92
Match 2:
1: #12
2: CARTESIAN_POINT
3: 'NONE'
4: ( 1.213489432997839200, 5.617300827691964000, -7.500000000000001800 )
Match 3:
1: #263
2: TEST
3: 'Spaces must not be ignored here'
Please help me finding the correct expression for the last group ((.*)
at the moment).
With an AGPL license for non-commercial use JSDAI is free and open source java toolkit for working with STEP files
http://www.jsdai.net/
BSD license, so always free and open source is the STEPcode project which generates C++ and python API's and example STEP file reader/writer, which is used by other open source projects such as BRL-CAD, SCView and OpenVSP.
www.stepcode.org
OpenCasCade has C++, pythonOCC has python, and node-occ has javascript API's for working with data that is translated from STEP, and are also free and open source. OCE works across more platforms and has more bug fixes
https://github.com/tpaviot/oce
feuerball, you asked for a regex... This one captures the five groups you want.
I formatted the regex in free-spacing mode to make it easier to read. I did not explain in detail but each line is commented and I am certain you are able to understand it. :)
regexp = /(?x) # free-spacing mode
^ # assert head of string
(\#\d+) # captures the digits into Group 1
\s*=\s* # gets us past the equal and spaces
([A-Z_]+) # captures the name into Group 2
\s*\(\s*' # gets us inside the opening quote
([^']*?)' # captures the string in Group 3
(?: # start optional non-capturing group, let's call it A
\s*,\s* # get over the comma and spaces
(\([^)]*?\)) # capture parens to Group 4
(?:\s*,\s* # start optional non-capturing group, let's call it B
([^\s)]+) # capture last string to Group 5
)? # end optional non-capturing group B
)? # end optional non-capturing group A
\s*\)\s*; # close string
/
subject.scan(regexp) {|result|
# If the regex has capturing groups, subject is an array with the text matched by each group (but without the overall match)
# If the regex has no capturing groups, subject is a string with the overall regex match
}
I don't think regular expressions are the way to go in this case. STEP is a pretty common format and there are parsers for it. Since you're using Java, why not take a look at this:
http://www.steptools.com/support/stdev_docs/javalib/programming.html#SEC0-5-0
I think this is the format you are using, right?
Unless you take the entire schema into account, you're bound to run into issues with Regular Expressions. Even if you do manage to account for everything, you've just written a kind of parser anyways. Why reinvent the wheel?