I'm parsing some text from a source outside my control, that is not in a very convenient format. I have lines like this:
Problem Category: Human Endeavors Problem Subcategory: Space ExplorationProblem Type: Failure to LaunchSoftware Version: 9.8.77.omni.3Problem Details: Issue with signal barrier chamber.
I want to split the line by keys like this:
Problem_Category = "Human Endeavors"
Problem_Subcategory = "Space Exploration"
Problem_Type = "Failure to Launch"
Software_Version = "9.8.77.omni.3"
Problem_Details = "Issue with signal barrier chamber."
The keys will always be in the same order, and are always followed by a semi-colon, but there is not necessarily space or newlines between a value and the next key. I'm not sure what can be used as a delimiter to parse this, since colons and spaces can appear in the values as well. How can I parse this text?
If your block of text is this string:
Then
yields the dict
So you could assign
and then you could access the subcategories with dict indexing:
Caution, though. Deeply nested dicts/DataFrames of lists of dicts is usually a bad design. As the Zen of Python says, Flat is better than nested.
Given that you know the keywords ahead of time, partition the text into "current keyword", "remaining text", then continue to partition the remaining text with the next keyword.
This prints:
I hate and fear regex, so here's a solution using only built-in methods.
Result (newlines added by me for readability):
That's just the job for general BNF parsing which handles ambiguity nicely. I used perl and Marpa, a general BNF parser. Hope this helps.
This prints: