I'd like to shorten a string using textwrap.shorten
or a function like it. The string can potentially have non-ASCII characters. What's special here is that the maximal width
is for the bytes
encoding of the string. This problem is motivated by the fact that several database column definitions and some message buses have a bytes
based max length.
For example:
>>> import textwrap
>>> s = '☺ Ilsa, le méchant ☺ ☺ gardien ☺'
# Available function that I tried:
>>> textwrap.shorten(s, width=27)
'☺ Ilsa, le méchant ☺ [...]'
>>> len(_.encode())
31 # I want ⩽27
# Desired function:
>>> shorten_to_bytes_width(s, width=27)
'☺ Ilsa, le méchant [...]'
>>> len(_.encode())
27 # I want and get ⩽27
It's okay for the implementation to use a width greater than or equal to the length of the whitespace-stripped placeholder [...]
, i.e. 5.
The text should not be shortened any more than necessary. Some buggy implementations can use optimizations which on occasion result in excessive shortening.
Using textwrap.wrap with bytes count is a similar question but it's different enough from this one since it is about textwrap.wrap
, not textwrap.shorten
. Only the latter function uses a placeholder
([...]
) which makes this question sufficiently unique.
Caution: Do not rely on any of the answers here for shortening a JSON encoded string in a fixed number of bytes. For it, substitute text.encode()
with json.dumps(text)
.
In theory it's enough to
encode
your string, then check if it fits in the "width" constraint. If it does, then the string can be simply returned. Otherwise you can take the first "width" bytes from the encoded string (minus the bytes needed for the placeholder). To make sure it works liketextwrap.shorten
one also needs to find the last whitespace in the remaining bytes and return everything before the whitespace + the placeholder. If there is no whitespace only the placeholder needs to be returned.Given that you mentioned that you really want it byte-amount constrained the function throws an exception if the placeholder is too large. Because having a placeholder that wouldn't fit into the byte-constrained container/datastructure simply doesn't make sense and avoids a lot of edge cases that could result in inconsistent "maximum byte size" and "placeholder byte size".
The code could look like this:
And a simple test case:
Which returns
The function also has an argument for normalizing the spaces. That could be helpful in case you have different kind of whitespaces (newlines, etc.) or multiple sequential spaces. Although it will be a bit slower.
Performance
I did a quick test using
simple_benchmark
(a library I wrote) to ensure it's actually faster.For the benchmark I create a string containing random unicode characters where the (on average) one out of 8 characters is a whitespace. I also use half the length of the string as byte-width to split. Both have no special reason, it could bias the benchmarks though, that's why I wanted to mention it.
The functions used in the benchmark:
I also did a second benchmark excluding the
shorten_to_bytes_width
function so I could benchmark even longer strings:Here is a solution that tries to solve this problem directly without playing trial and error with
textwrap.shorten()
using different input strings.It uses a recursive algorithm based on educated guesses about the minimum and maximum length of the string. Partial solutions (based on the guessed minimum length) are used to reduce the problem size quickly.
The solution has two parts:
bytes_to_char_length()
computes the maximum number of chars in a string that fit within a number of bytes (see below for examples of how it works).shorten_to_bytes()
which uses the result ofbytes_to_char_length()
to calculate the placeholder position.Examples for how
bytes_to_char_length()
worksFor illustration purposes let’s assume each digit in the string is encoded to its value in bytes. So
'1'
,'2'
,'3'
,'4'
take 1, 2, 3 and 4 Bytes respectively.For
bytes_to_char_length('11111', 3)
we’ll get:max_length
is set to3
by default.input[start:start + max_length] = '111'
which has 3 Bytes, sobytes_too_much = 0
For
bytes_to_char_length('441111', 10)
:max_length
is set to6
input[start:start + max_length] = '441111'
with 12 Bytes, sobytes_too_much = 2
min_length
is set tomax_length - 2 == 4
. (It takes at maximum 2 characters to take up 2 Bytes).max_length
is reduced by 1 (It takes at least 1 character to take 2 Bytes).bytes_left = 0
,max_length = 1
0
because there are no bytes left. Result ismin_length + 0 == 4
.For
bytes_to_char_length('111144', 10)
:max_length
is set to6
(as before)input[start:start + max_length] = '111144'
with 12 Bytes, sobytes_too_much = 2
min_length
is set tomax_length - 2 == 4
max_length
is reduced by 1.new_start = 4
,remaining_bytes = 6
,max_length = 1
4 + bytes_to_char_length('111144', 6, start=4, max_length=1)
input[start:start + max_length] = '4'
with 4 Bytes, sobytes_too_much = -2
max_length == 1
, return5
as result.Formally it makes the following assumptions:
MAX_BYTES_BY_CHAR
in the encoded string.s
into substringss == s1 + s2
, thens.encode() == s1.encode() + s2.encode()
Performance
I will propose a naive solution with a loop and checking len of encoded characters like
len(text[index].encode())
. Also added timings for improvement proposed in this commentTimings:
This solution is inefficient, but it does appear to always work correctly and without ever shortening excessively. It serves as a canonical baseline for testing any efficient solutions.
It first shortens pretending that the text is an ASCII string; this can shorten insufficiently but never excessively. It then inefficiently shortens one character at a time, and no more than necessary.
Credit: Thanks to Sanyash for an improvement.
Test
Testing a candidate answer
Any candidate answer can be tested by comparing its outputs with the outputs of my function for
width
ofrange(50, -1, -1)
or at minimumrange(50, 5, -1)
. Given acandidate
function, the code below implements the unit test: