可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Checking the documentation on memoryview:
memoryview objects allow Python code to access the internal data of an
object that supports the buffer protocol without copying.
class memoryview(obj)
Create a memoryview that references obj. obj must support the
buffer protocol. Built-in objects that support the buffer protocol
include bytes and bytearray.
Then we are given the sample code:
>>> v = memoryview(b'abcefg')
>>> v[1]
98
>>> v[-1]
103
>>> v[1:4]
<memory at 0x7f3ddc9f4350>
>>> bytes(v[1:4])
b'bce'
Quotation over, now lets take a closer look:
>>> b = b'long bytes stream'
>>> b.startswith(b'long')
True
>>> v = memoryview(b)
>>> vsub = v[5:]
>>> vsub.startswith(b'bytes')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'memoryview' object has no attribute 'startswith'
>>> bytes(vsub).startswith(b'bytes')
True
>>>
So what I gather from the above:
We create a memoryview object to expose the internal data of a buffer object without
copying, however, in order to do anything useful with the object (by calling the methods
provided by the object), we have to create a copy!
Usually memoryview (or the old buffer object) would be needed when we have a large object,
and the slices can be large too. The need for a better efficiency would be present
if we are making large slices, or making small slices but a large number of times.
With the above scheme, I don't see how it can be useful for either situation, unless
someone can explain to me what I'm missing here.
Edit1:
We have a large chunk of data, we want to process it by advancing through it from start to
end, for example extracting tokens from the start of a string buffer until the buffer is consumed.In C term, this is advancing a pointer through the buffer, and the pointer can be passed
to any function expecting the buffer type. How can something similar be done in python?
People suggest workarounds, for example many string and regex functions take position
arguments that can be used to emulate advancing a pointer. There're two issues with this: first
it's a work around, you are forced to change your coding style to overcome the shortcomings, and
second: not all functions have position arguments, for example regex functions and startswith
do, encode()
/decode()
don't.
Others might suggest to load the data in chunks, or processing the buffer in small
segments larger than the max token. Okay so we are aware of these possible
workarounds, but we are supposed to work in a more natural way in python without
trying to bend the coding style to fit the language - aren't we?
Edit2:
A code sample would make things clearer. This is what I want to do, and what I assumed memoryview would allow me to do at first glance. Lets use pmview (proper memory view) for the functionality I'm looking for:
tokens = []
xlarge_str = get_string()
xlarge_str_view = pmview(xlarge_str)
while True:
token = get_token(xlarge_str_view)
if token:
xlarge_str_view = xlarge_str_view.vslice(len(token))
# vslice: view slice: default stop paramter at end of buffer
tokens.append(token)
else:
break
回答1:
One reason memoryviews
are useful is because they can be sliced without copying the underlying data, unlike bytes
/str
.
For example, take the following toy example.
import time
for n in (100000, 200000, 300000, 400000):
data = 'x'*n
start = time.time()
b = data
while b:
b = b[1:]
print 'bytes', n, time.time()-start
for n in (100000, 200000, 300000, 400000):
data = 'x'*n
start = time.time()
b = memoryview(data)
while b:
b = b[1:]
print 'memoryview', n, time.time()-start
On my computer, I get
bytes 100000 0.200068950653
bytes 200000 0.938908100128
bytes 300000 2.30898690224
bytes 400000 4.27718806267
memoryview 100000 0.0100269317627
memoryview 200000 0.0208270549774
memoryview 300000 0.0303030014038
memoryview 400000 0.0403470993042
You can clearly see quadratic complexity of the repeated string slicing. Even with only 400000 iterations, it's already unmangeable. Meanwhile, the memoryview version has linear complexity and is lightning fast.
Edit: Note that this was done in CPython. There was a bug in Pypy up to 4.0.1 that caused memoryviews to have quadratic performance.
回答2:
memoryview
objects are great when you need subsets of binary data that only need to support indexing. Instead of having to take slices (and create new, potentially large) objects to pass to another API you can just take a memoryview
object.
One such API example would be the struct
module. Instead of passing in a slice of the large bytes
object to parse out packed C values, you pass in a memoryview
of just the region you need to extract values from.
memoryview
objects, in fact, support struct
unpacking natively; you can target a region of the underlying bytes
object with a slice, then use .cast()
to 'interpret' the underlying bytes as long integers, or floating point values, or n-dimensional lists of integers. This makes for very efficient binary file format interpretations, without having to create more copies of the bytes.
回答3:
Let me make plain where lies the glitch in understanding here.
The questioner, like myself, expected to be able to create a memoryview that selects a slice of an existing array (for example a bytes or bytearray). We therefore expected something like:
desired_slice_view = memoryview(existing_array, start_index, end_index)
Alas, there is no such constructor, and the docs don't make a point of what to do instead.
The key is that you have to first make a memoryview that covers the entire existing array. From that memoryview you can create a second memoryview that covers a slice of the existing array, like this:
whole_view = memoryview(existing_array)
desired_slice_view = whole_view[10:20]
In short, the purpose of the first line is simply to provide an object whose slice implementation (dunder-getitem) returns a memoryview.
That might seem untidy, but one can rationalize it a couple of ways:
Our desired output is a memoryview that is a slice of something. Normally we get a sliced object from an object of that same type, by using the slice operator [10:20] on it. So there's some reason to expect that we need to get our desired_slice_view from a memoryview, and that therefore the first step is to get a memoryview of the whole underlying array.
The naive expection of a memoryview constructor with start and end arguments fails to consider that the slice specification really needs all the expressivity of the usual slice operator (including things like [3::2] or [:-4] etc). There is no way to just use the existing (and understood) operator in that one-liner constructor. You can't attach it to the existing_array argument, as that will make a slice of that array, instead of telling the memoryview constructor some slice parameters. And you can't use the operator itself as an argument, because it's an operator and not a value or object.
Conceivably, a memoryview constructor could take a slice object:
desired_slice_view = memoryview(existing_array, slice(1, 5, 2) )
... but that's not very satisfactory, since users would have to learn about the slice object and what its constructor's parameters mean, when they already think in terms of the slice operator's notation.
回答4:
Excellent example by Antimony.
Actually, in Python3, you can replace data = 'x'*n by data = bytes(n) and put parenthesis to print statements as below:
import time
for n in (100000, 200000, 300000, 400000):
#data = 'x'*n
data = bytes(n)
start = time.time()
b = data
while b:
b = b[1:]
print('bytes', n, time.time()-start)
for n in (100000, 200000, 300000, 400000):
#data = 'x'*n
data = bytes(n)
start = time.time()
b = memoryview(data)
while b:
b = b[1:]
print('memoryview', n, time.time()-start)
回答5:
Here is python3 code.
#!/usr/bin/env python3
import time
for n in (100000, 200000, 300000, 400000):
data = b'x'*n
start = time.time()
b = data
while b:
b = b[1:]
print ('bytes {:d} {:f}'.format(n,time.time()-start))
for n in (100000, 200000, 300000, 400000):
data = b'x'*n
start = time.time()
b = memoryview(data)
while b:
b = b[1:]
print ('memview {:d} {:f}'.format(n,time.time()-start))