I took this example from web. My document one contains:
Document 1 :
Purpose of visit : For physical check up.
History of patient : This is the first admission for this 56 year old woman, who states she was in her usual state of good health until one week prior to admission. At that time she noticed the abrupt onset (over a few seconds to a minute) of chest pain which she describes as dull and aching in character. The pain began in the left para-sternal area and radiated up to her neck.
Medications : 1. Critizin. 2. p.n.b.s
Review of Systems :
HEENT:
1 or 2 beers each weekend; 1 glass of wine once a week with dinner.
Cadiovascular:
See HPI
Document 2 contains :
Purpose of visit : For physical check up.
History of patient : This is the first admission for this 56 year old woman, who states she was in her usual state of good health until one week prior to admission. At that time she noticed the abrupt onset (over a few seconds to a minute) of chest pain which she describes as dull and aching in character. The pain began in the left para-sternal area and radiated up to her neck. She does not smoke nor does she have diabetes. She was diagnosed with hypertension 3 years ago and had a TAH with BSO 6 years ago. She is not on hormone replacement therapy. There is a family history of premature CAD. She does not know her cholesterol level.
Medications : 1. Critizin. 2. Flexon
Review of Systems :
HEENT:
1 or 2 beers each weekend; 1 glass of wine once a week with dinner.
Cadiovascular: See HPI
Genitourinary: No complaints of dysuria, nocturia, polyuria, hematuria, or vaginal bleeding.
I was thinking split each line in file on the basis of (.) and split section on the basis of (:). But sometimes in file I also have 3.5 or in medicine section all medicine are seprated by (.) like medicine 1 hello. 2 hi.
How I can calculate similarity score between these sections of two files.
You can use
difflib
module.In your case, you need difflib.SequenceMatcher, class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.
Sample example:
Now for measuring the similarity of the sequences, use
ratio()
which returns afloat
in[0, 1]
. As a rule of thumb, a ratio() value over 0.6 means the sequences are close matches.