Parse subtitle file using regex C#

2019-02-27 17:06发布

I need to find the number, the in and out timecode points and all lines of the text.

9
00:09:48,347 --> 00:09:52,818
- Let's see... what else she's got?
- Yea... ha, ha.

10
00:09:56,108 --> 00:09:58,788
What you got down there, missy?

11
00:09:58,830 --> 00:10:00,811
I wouldn't do that!

12
00:10:03,566 --> 00:10:07,047
-Shit, that's not enough!
-Pull her back!

I'm currently using this pattern but it forgets all two lines text

(?<Order>\d+)\r\n(?<StartTime>(\d\d:){2}\d\d,\d{3}) --> (?<EndTime>(\d\d:){2}\d\d,\d{3})\r\n(?<Sub>.+)(?=\r\n\r\n\d+|$)

Any help would be much appreciated.

5条回答
萌系小妹纸
2楼-- · 2019-02-27 17:22

I used this regex in my Ruby parser:

slines.scan(/(^[0-9]+)\r?\n(.*? --> .*?)\r?\n(.*?)(?=^[0-9]+\r?\n|\s+\Z)/im).map{|z| [z[0],[z[1],z[2].strip]]}

where "slines" is the whole subtitle file read into memory.

查看更多
甜甜的少女心
3楼-- · 2019-02-27 17:23

I am using following regular expression to parse .srt files:

@"(?<number>\d+)\r\n(?<start>\S+)\s-->\s(?<end>\S+)\r\n(?<text>(.|[\r\n])+?)\r\n\r\n"

Regular Expression Language - Quick Reference

查看更多
成全新的幸福
4楼-- · 2019-02-27 17:24

I think there's two problems with the regex. The first is that the . near the end in (?<Sub>.+) is not matching newlines. So you could modify it to:

(?<Sub>(.|[\r\n])+?)

Or you could specify RegexOptions.Singleline as an option to the regex. The only thing the option does is make the dot match newlines.

The second problem is that .+ matches as many lines as it can. You can make it non-greedy like:

(?<Sub>(.|[\r\n])+?(?=\r\n\r\n|$))

This matches the least amount of text that ends with an empty line or the end of the string.

查看更多
我只想做你的唯一
5楼-- · 2019-02-27 17:32

I would personally split the lines into an array and loop through the array examining each line, just doing a regex match for the StartTime->EndTime lines, then you can use some fairly simple logic to grab Order from the previous line, and grab the text from lines following(by searching ahead to find the next StartTime->Endtime and backtracking two lines).

I think this way chops the problem up a little so that you don't have a regex expression trying to do it all.

查看更多
Juvenile、少年°
6楼-- · 2019-02-27 17:37

If I were you, I'd step back from a regex-based implementation and look at a state machine, walking through the file line by line. Your format looks simple enough to handle with maybe 20-40 lines of easy-to-understand code, but too complex for a reasonable regex.

查看更多
登录 后发表回答