这个问题跟进在此之前的帖子由@ecortazar回答。 不过,我也想在pd.Series两个元素是不包括某些字符串,只用熊猫/ numpy的之间粘贴。 注:所有的线href
的文字是不同的。
import pandas as pd
import numpy as np
table = pd.Series(
["<td class='test'>AA</td>", # 0
"<td class='test'>A</td>", # 1
"<td class='test'><a class='test' href=...", # 2
"<td class='test'>B</td>", # 3
"<td class='test'><a class='test' href=...", # 4
"<td class='test'>BB</td>", # 5
"<td class='test'>C</td>", # 6
"<td class='test'><a class='test' href=...", # 7
"<td class='test'>F</td>", # 8
"<td class='test'>G</td>", # 9
"<td class='test'><a class='test' href=...", # 10
"<td class='test'>X</td>"]) # 11
dups = ~table.str.contains('href') & table.shift(-1).str.contains('href')
array = np.insert(table.values, dups[dups].index, "None")
pd.Series(array)
# OUTPUT:
# 0 <td class='test'>AA</td>
# 1 None
# 2 <td class='test'>A</td>
# 3 <td class='test'><a class='test' href=...
# 4 None Incorrect
# 5 <td class='test'>B</td>
# 6 <td class='test'><a class='test' href=...
# 7 <td class='test'>BB</td>
# 8 None
# 9 <td class='test'>C</td>
# 10 <td class='test'><a class='test' href=...
# 11 <td class='test'>F</td>
# 12 None
# 13 <td class='test'>G</td>
# 14 <td class='test'><a class='test' href=...
# 15 <td class='test'>X</td>
下面是实际的文本输出,我想。
# OUTPUT:
# 0 <td class='test'>AA</td>
# 1 None
# 2 <td class='test'>A</td>
# 3 <td class='test'><a class='test' href=...
# 4 <td class='test'>B</td>
# 5 <td class='test'><a class='test' href=...
# 6 <td class='test'>BB</td>
# 7 None
# 8 <td class='test'>C</td>
# 9 <td class='test'><a class='test' href=...
# 10 <td class='test'>F</td>
# 11 None
# 12 <td class='test'>G</td>
# 13 <td class='test'><a class='test' href=...
# 14 <td class='test'>X</td>