I have a Persian text file that has some lines like this:
ذوب 6 خوی 7 بزاق ،آبدهان ، یم 10 زهاب، 11 آبرو، حیثیت، شرف
I want to generate a list of words from this line. For me the word borders are numbers, like 6, 7, etc in the above line and also ،
character.
so the list should be:
[ 'ذوب','خوی','بزاق','آبدهان','یم','زهاب','آبرو','حیثیت' ,'شرف']
I want to do this in Python 3.3. What is the best way of doing this, I really appreciate any help on this.
EDIT:
I got a number of answers but when I used them for another test case they didn't work. The test case is this:
منهدم کردن : 1 خراب کردن، ویران کردن، تخریب کردن 2 نابود کردن، از بین بردن
and I expect to have a list of tokens as this:
['منهدم کردن','خراب کردن', 'ویران کردن', 'تخریب کردن','نابود کردن', 'از بین بردن']
Use
re.split
to split on whitespace (\s
), digits (\d
) and the،
character.Note the
\u200c
you are seeing in the output array is a non-printing character, and is actually contained in the original string. Python is escaping it as it is showing the representation of the array and contained strings, not printing the string for display. Here's the difference:This is similar to how python handles
newline
characters:Edit:
Here is the regex for your updated sample that uses falsetru's findall strategy, but uses the built-in
re
module:The pattern
(?:[^\W\d_]|[\s])+
is a little strange, as Python's re module has no equivalent to regex's "Letters"\p{L}
, so instead we use the solution proposed here https://stackoverflow.com/a/8923988/66349So in summary, match one or more characters (
+
) that are either (|
): Unicode letters[^\W\d_
, or whitespace\s
.falsetru's method is probably more readable, but requires the 3rd party library.
Using
regex
package:str.replace
.\p{L}
or\p{Letter}
matches any kind of letter from any language.See Regex Tutorial - Unicode Characters and Properties.
UPDATE
To also include U+200C, use
[\p{Cf}\p{L}]+
instead (\p{Cf}
or\p{Format}
matches invisible formatting character):It looks diffent from what you want, but they are equal:
UPDATE2
Some words in the edited question contains a space.
I added
\s
in the following code to also match the spaces, then strip the leading, trailing spaces from the matched strings, then filtered out empty strings.