可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm trying to match SHA1's in generic text with a regular expression.
Ideally I want to avoid matching words.
It's safe to say that full SHA1's have a distinctive pattern (they're long and a consistent length) - so I can match these reliably - but what about abbreviated SHA1's?
Can I rely on the presence of numbers?
Looking at the SHA1's in my commit log - numbers always appear in the first 3 characters. But is this too short? How many characters of SHA1 do I need to consider before I can assume a number would have appeared?
This does not have to be 100% accurate - I just need to match an abbreviated SHA1 99% of the time.
回答1:
You can consider the SHA1 hashes to be completely random, so this reduces to a matter of probabilities. The probability that a given digit is not a number is 6/16, or 0.375. The probability that three SHA1 digits are all not numbers is 0.375 ** 3, or 0.0527 (5% ish). At six digits, this reduces again to 0.00278 (0.2%). At five digits, the probability of all letters drops below 1% (you said you wanted to match 99% of the time).
It's easy to craft a regular expression that always matches SHA1 values:
\b[0-9a-f]{5,40}\b
However, this may also match perfectly good five letter words, like "added" or "faded". In my /usr/share/dict/words
file, there are several six letter words that would match: "accede", "beaded", "bedded", "decade", "deface", "efface", and "facade" are the most likely. At seven letters, there is only "deedeed" which is unlikely to appear in prose. It all depends on how many false positives you can tolerate, and what the likely words you will encounter actually are.
回答2:
What exactly are you trying to do? You shouldn't need to parse anything git outputs with heuristics -- you can always request exactly the data you need.
If you want to match a full hex representation of an SHA1 sum, try:
/\b([a-f0-9]{40})\b/
That is, a word consisting of 40 characters which are either digits or the letters a through f.
If you only have a few characters and don't know where they are, then you are pretty much out of luck. Is "e78fd98" an abbreviated commit ID? Maybe, but what about "1234567"? Is that a commit ID? A problem ticket number? A number that makes a test fail?
Without context, you can't really know what the data means.
To answer your direct question, there is no property of SHA1 that would make the first three characters (in hex form) digits. You are just lucky, or perhaps unlucky, depending on how you look at it.
回答3:
I'm going to assume you want to match against hexadecimal printed representation of a SHA1, and not against the equivalent 20 raw bytes. Furthermore, I'm going to assume that the SHA1's in question use only lower-case letters to represent hex digits. You'll have to adjust the regular expression if your requirements differ.
grep -o -E -e "[0-9a-f]{40}"
Will match such a SHA1. You'll need to translate the above regular expression from egrep's dialect to whatever tool you happen to be using. Since the match must be exactly 40 characters long I don't think you're in danger of accidentally matching words. I don't know of any 40-character words that consist only of the letters a through f.
edit:
Better yet: use A Regex to match a SHA1 as his solution includes checking for word boundaries at both ends. I overlooked that above.
回答4:
If you have access to the repo, you can use git cat-file -e
to check for sure that it represents an object in the repo. This is very fast, too. If you further want to restrict this to just commits and tags, you can use git cat-file -t
to find out the type of the object.
This could be used, for example, to search human-generated text for mentions of git commits and generate hyperlinks to a git web interface.
回答5:
For this type of hash : 43:A4:02:B7:B6:1D:89:86:C5:CE:AD:52:96:D9:2E:7B:64:98:45:6A
:
/^[0-9A-F]{2}(:[0-9A-F]{2}){19}$/
回答6:
I use this in ruby. It allows for a short version of the sha (6 - 8 in case of clashes) and for the full sha at 40 chars long.
\A(([0-9a-f]{40})|([0-9a-f]{6,8}))\z