Querying a column with lists in it

I have a dataframe with columns with lists in them. How can I query these?

>>> df1.shape
(1812871, 7)
>>> df1.dtypes
CHROM     object
POS        int32
ID        object
REF       object
ALT       object
QUAL        int8
FILTER    object
dtype: object
>>> df1.head()
  CHROM    POS           ID REF   ALT  QUAL  FILTER
0    20  60343  rs527639301   G   [A]   100  [PASS]
1    20  60419  rs538242240   A   [G]   100  [PASS]
2    20  60479  rs149529999   C   [T]   100  [PASS]
3    20  60522  rs150241001   T  [TC]   100  [PASS]
4    20  60568  rs533509214   A   [C]   100  [PASS]
>>> df2 = df1.head(30)
>>> df3 = df1.head(3000)

I found a previous question, but the solutions do not quite work for me. The accepted solution does not work:

>>> df2[df2.ALT.apply(lambda x: x == ['TC'])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
    return self._getitem_array(key)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer
    indexer = check = labels.get_indexer(objarr)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer
    indexer = self._engine.get_indexer(target._ndarray_values)
  File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer
  File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'numpy.ndarray'

The reason being, the booleans get nested:

>>> df2.ALT.apply(lambda x: x == ['TC']).head()
0    [False]
1    [False]
2    [False]
3     [True]
4    [False]
Name: ALT, dtype: object

So I tried the second answer, which seemed to work:

>>> c = np.empty(1, object)
>>> c[0] = ['TC']
>>> df2[df2.ALT.values == c]
  CHROM    POS           ID REF   ALT  QUAL  FILTER
3    20  60522  rs150241001   T  [TC]   100  [PASS]

But strangely, it doesn't work when I try it on the larger dataframe:

>>> df3[df3.ALT.values == c]
Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False

Which is probably because the result of the boolean comparison is different!

>>> df3.ALT.values == c
False
>>> df2.ALT.values == c
array([False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False])

This is completely baffling to me.

标签： pandas nested

1条回答

对你真心纯属浪费

2楼-- · 2019-08-26 23:04

I found a hacky solution of casting the list as tuples works for me

df = pd.DataFrame({'CHROM': [20] *5,
                   'POS': [60343, 60419, 60479, 60522, 60568],
                   'ID': ['rs527639301', 'rs538242240', 'rs149529999', 'rs150241001', 'rs533509214'],
                   'REF': ['G', 'A', 'C', 'T', 'A'],
                   'ALT': [['A'], ['G'], ['T'], ['TC'], ['C']],
                   'QUAL': [100] * 5,
                   'FILTER': [['PASS']] * 5})
df['ALT'] = df['ALT'].apply(tuple)

df[df['ALT'] == ('C',)]

This method works because the immutability of tuples allows pandas to check if the entire element is correct compared to the intra-list elementwise comparison you got for the Boolean series because lists are not hashable.

0人赞添加讨论(0) 举报

Querying a column with lists in it

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间