d_hsp={"1":"I","2":"II","3":"III","4":"IV","5":"V","6":"VI","7":"VII","8":"VIII",
"9":"IX","10":"X","11":"XI","12":"XII","13":"XIII","14":"XIV","15":"XV",
"16":"XVI","17":"XVII","18":"XVIII","19":"XIX","20":"XX","21":"XXI",
"22":"XXII","23":"XXIII","24":"XXIV","25":"XXV"}
HSP_OLD['tryl'] = HSP_OLD['tryl'].replace(d_hsp, regex=True)
HSP_OLD
is a dataframe, tryl
is one column of HSP_OLD
, and here's some example of values in tryl
:
SAF/HSP: Secondary diagnosis E code 1
SAF/HSP: Secondary diagnosis E code 11
I use a dictionary to replace, it works for 1-10, but for 11, it will become "II" , for 12, it will become "III".
You need to keep the order of the items, and start searching with the longest substring.
You may use an OrderDict
here. To initialize it, use a list of tuples. You may reverse it already here, when initializing, but you can do it later, too.
import collections
import pandas as pd
# My test data
HSP_OLD = pd.DataFrame({'tryl':['1. Text', '11. New Text', '25. More here']})
d_hsp_lst=[("1","I"),("2","II"),("3","III"),("4","IV"),("5","V"),("6","VI"),("7","VII"),("8","VIII"), ("9","IX"),("10","X"),("11","XI"),("12","XII"),("13","XIII"),("14","XIV"),("15","XV"), ("16","XVI"),("17","XVII"),("18","XVIII"),("19","XIX"),("20","XX"),("21","XXI"), ("22","XXII"),("23","XXIII"),("24","XXIV"),("25","XXV")]
d_hsp = collections.OrderedDict(d_hsp_lst) # Creating the OrderedDict
d_hsp = collections.OrderedDict(reversed(d_hsp.items())) # Here, reversing
>>> HSP_OLD['tryl'] = HSP_OLD['tryl'].replace(d_hsp, regex=True)
>>> HSP_OLD
tryl
0 I. Text
1 XI. New Text
2 XXV. More here
Sorry, didn't notice that you're not merely updating the field but you actually want to replace a number at the end, but even if that's the case - it's much better to properly convert your number to roman numerals than to map every possible occurrence of such (what would happen with your code if there is a number larger than 25?). So, here's one way to do it:
ROMAN_MAP = [(1000, 'M'), (900, 'CM'), (500, 'D'), (400, 'CD'), (100, 'C'), (90, 'XC'),
(50, 'L'), (40, 'XL'), (10, 'X'), (9, 'IX'), (5, 'V'), (4, 'IV'), (1, 'I')]
def romanize(data):
if not data or not isinstance(data, str): # we know how to work with strings only
return data
data = data.rstrip() # remove potential extra whitespace at the end
space_pos = data.rfind(" ") # find the last space before the number
if space_pos != -1:
try:
number = int(data[space_pos + 1:]) # get the number at the end
roman_number = ""
for i, r in ROMAN_MAP: # loop-reduce substitution based on the ROMAN_MAP
while number >= i:
roman_number += r
number -= i
return data[:space_pos + 1] + roman_number # put everything back together
except (TypeError, ValueError):
pass # couldn't extract a number
return data
So now if we create your data frame as:
HSP_OLD = pd.DataFrame({"tryl": ["SAF/HSP: Secondary diagnosis E code 1",
None,
"SAF/HSP: Secondary diagnosis E code 11",
"Something else without a number at the end"]})
We can noe easily apply our function over the whole column with:
HSP_OLD['tryl'] = HSP_OLD['tryl'].apply(romanize)
Which results in:
tryl
0 SAF/HSP: Secondary diagnosis E code I
1 None
2 SAF/HSP: Secondary diagnosis E code XI
3 Something else without a number at the end
Of course, you can adapt the romanize()
function to your needs to search any number within your string and turn it to roman numerals - this is just an example for how to quickly find the number at the end of the string.