I need to find the number, the in and out timecode points and all lines of the text.
9
00:09:48,347 --> 00:09:52,818
- Let's see... what else she's got?
- Yea... ha, ha.
10
00:09:56,108 --> 00:09:58,788
What you got down there, missy?
11
00:09:58,830 --> 00:10:00,811
I wouldn't do that!
12
00:10:03,566 --> 00:10:07,047
-Shit, that's not enough!
-Pull her back!
I'm currently using this pattern but it forgets all two lines text
(?<Order>\d+)\r\n(?<StartTime>(\d\d:){2}\d\d,\d{3}) --> (?<EndTime>(\d\d:){2}\d\d,\d{3})\r\n(?<Sub>.+)(?=\r\n\r\n\d+|$)
Any help would be much appreciated.
I think there's two problems with the regex. The first is that the .
near the end in (?<Sub>.+)
is not matching newlines. So you could modify it to:
(?<Sub>(.|[\r\n])+?)
Or you could specify RegexOptions.Singleline
as an option to the regex. The only thing the option does is make the dot match newlines.
The second problem is that .+
matches as many lines as it can. You can make it non-greedy like:
(?<Sub>(.|[\r\n])+?(?=\r\n\r\n|$))
This matches the least amount of text that ends with an empty line or the end of the string.
If I were you, I'd step back from a regex-based implementation and look at a state machine, walking through the file line by line. Your format looks simple enough to handle with maybe 20-40 lines of easy-to-understand code, but too complex for a reasonable regex.
I would personally split the lines into an array and loop through the array examining each line, just doing a regex match for the StartTime->EndTime lines, then you can use some fairly simple logic to grab Order from the previous line, and grab the text from lines following(by searching ahead to find the next StartTime->Endtime and backtracking two lines).
I think this way chops the problem up a little so that you don't have a regex expression trying to do it all.
I am using following regular expression to parse .srt files:
@"(?<number>\d+)\r\n(?<start>\S+)\s-->\s(?<end>\S+)\r\n(?<text>(.|[\r\n])+?)\r\n\r\n"
Regular Expression Language - Quick Reference
I used this regex in my Ruby parser:
slines.scan(/(^[0-9]+)\r?\n(.*? --> .*?)\r?\n(.*?)(?=^[0-9]+\r?\n|\s+\Z)/im).map{|z| [z[0],[z[1],z[2].strip]]}
where "slines" is the whole subtitle file read into memory.