Emacs, unicode, xterm mouse escape sequences, and

2020-02-23 08:15发布

问题:

Short version: When using emacs' xterm-mouse-mode, Somebody (emacs? bash? xterm?) intercepts xterm's control sequences and replaces them with \0. This is a pain on wide monitors because only the first 223 columns have mouse.

What is the culprit, and how can I work around it?

From what I can tell this has something to do with Unicode/UTF-8 support, because it wasn't a problem 5-6 years ago when I last had a big monitor.

Gory details follow...

Thanks!

Emacs xterm-mouse-mode has a well-known weakness handling mouse clicks starting around x=95. A workaround, adopted by recent versions of emacs, pushes the problem off to x=223.

Several years ago I figured out that xterm encodes positions in 7-bit octets. Given position 'x' to encode, with X=x-96, send:

\40+x (x < 96)  
\300+X/64 \200+X%64 (otherwise)  

We have to add one to given x position from emacs, because positions in xterm start at one, not zero. Hence the magic x=95 number pops up because it's coded as "\300\200" -- the first escaped number. Somebody (emacs? bash? xterm?) treats those like "C0" control sequences from ISO 2022. Starting at x=159, we change to "C1" sequences (\301\200), which are also part of ISO 2022.

Trouble hits with \302 sequences, which corresponds to the current x=223 limit. Several years ago I was able to extend the hack to intercept \302 and \303 sequences manually, which got past the problem. Fast forward a few years, and today I find that I'm stuck back at x=223 because Somebody is replacing those sequences with \0.

So, where I'd expect clicking at line 1, col 250 to produce

ESC [ M SPC \303\207 ! ESC [ M # \303\207 !

Instead emacs reports (for any col > 223)

ESC [ M SPC C-@ ! ESC [ M # C-@ !

I suspect that Unicode/UTF-8 support is the culprit. Some digging shows that the Unicode standard allowed C0 and C1 sequences as part of UTF-8 until Nov 2000, and I guess Somebody didn't get the memo (fortunately). However, \302\200 - \302\237 are Unicode control sequences, so Somebody slurps them up (doing who-knows-what with them!) and returns \0 instead.

Some more detailed questions:
- Who is this Somebody that intercepts the codes before they reach emacs' lossage buffer?
- If it's really just about control sequences, how come characters after \302\237, which are UTF-8 encodings of printable Unicode, also come back as \0 ?
- What makes emacs decide whether to display lossage as unicode characters or octal escape sequences, and why don't the two match? For example, my self-built cygwin emacs 23.2.1 (xterm 229) reports \301\202 for column 161, but my rhel5.5-supplied emacs 22.3.1 (xterm 215) reports "Â" (latin A with circumflex), which is actually \303\202 in UTF-8!

Update:

Here's a patch against xterm-261 which makes it emit mouse positions in utf-8 format:

diff -r button.c button.utf-8-fix.c
--- a/button.c  Sat Aug 14 08:23:00 2010 +0200
+++ b/button.c  Thu Aug 26 16:16:48 2010 +0200
@@ -3994,1 +3994,27 @@
-#define MOUSE_LIMIT (255 - 32)
+#define MOUSE_LIMIT (2047 - 32)
+#define MOUSE_UTF_8_START (127 - 32)
+
+static unsigned
+EmitMousePosition(Char line[], unsigned count, int value)
+{
+    /* Add pointer position to key sequence
+     * 
+     * Encode large positions as two-byte UTF-8 
+     *
+     * NOTE: historically, it was possible to emit 256, which became
+     * zero by truncation to 8 bits. While this was arguably a bug,
+     * it's also somewhat useful as a past-end marker so we keep it.
+     */
+    if(value == MOUSE_LIMIT) {
+       line[count++] = CharOf(0);
+    }
+    else if(value < MOUSE_UTF_8_START) {
+       line[count++] = CharOf(' ' + value + 1);
+    }
+    else {
+       value += ' ' + 1;
+       line[count++] = CharOf(0xC0 + (value >> 6));
+       line[count++] = CharOf(0x80 + (value & 0x3F));
+    }
+    return count;
+}
@@ -4001,1 +4027,1 @@
-    Char line[6];
+    Char line[9]; /* \e [ > M Pb Pxh Pxl Pyh Pyl */
@@ -4021,2 +4047,0 @@
-    else if (row > MOUSE_LIMIT)
-       row = MOUSE_LIMIT;
@@ -4028,1 +4052,5 @@
-    else if (col > MOUSE_LIMIT)
+
+    /* Limit to representable mouse dimensions */
+    if (row > MOUSE_LIMIT)
+       row = MOUSE_LIMIT;
+    if (col > MOUSE_LIMIT)
@@ -4090,2 +4118,2 @@
-       line[count++] = CharOf(' ' + col + 1);
-       line[count++] = CharOf(' ' + row + 1);
+       count = EmitMousePosition(line, count, col);
+       count = EmitMousePosition(line, count, row);

Hopefully this (or something like it) will appear in a future version of xterm... the patch makes xterm work out of the box with emacs-23 (which assumes utf-8 input) and fixes the existing problems with xt-mouse.el also. To use it with emacs-22 requires a redefinition of the function it uses to decode mouse positions (the new definition works fine with emacs-23 also):

(defadvice xterm-mouse-event-read (around utf-8 compile activate)
  (setq ad-return-value
        (let ((c (read-char)))
          (cond
           ;; mouse clicks outside the encodable range produce 0
           ((= c 0) #x800)
           ;; must convert UTF-8 to unicode ourselves
           ((and (>= c #xC2) (< emacs-major-version 23))
            (logior (lsh (logand c #x1F) 6) (logand (read-char) #x3F)))
           ;; normal case
           (c) ) )))

Distribute the defun as part of the .emacs on all machines you log into, and patch the xterm on any machines you work from. Voila!

WARNING: Applications which use xterm's mouse modes but do not treat their input as utf-8 will get confused by this patch because the mouse escape sequences get longer. However, those applications break horribly with the current xterm because mouse positions with x > 95 look like utf-8 codes but aren't. I'd create a new mouse mode for xterm, but certain applications (gnu screen!) filter out unknown escape sequences. Emacs is the only terminal-mouse app I use, so I consider the patch a net win, but YMMV.

回答1:

xterm-262 adds the patch inlined above, however, this patch quite is broken by design. Rxvt-unicode's developers realized it and added yet another, much better extension to report mouse coordinates.

Right now I'm working on getting widespread support for this. Rxvt-unicode and iTerm2 already support both extensions. I created patches for xterm (to support the urxvt extension), and for gnome-terminal, konsole and putty to support both new extension. As for the applications, I've added support for the urxvt extension to Midnight Commander.

Please join me in my effort and try to convince more terminal developers and applications to implement these extensions (at least the urxvt one, because the other one can't be properly automatically recognized by applications).

See http://www.midnight-commander.org/ticket/2662 for technical details and further pointers.



回答2:

OK, figured it out. There are actually two issues.

First, some source diving shows that xterm clips the mouse-enabled region of the window to 223x223 chars, and sends 0x0 for all other positions.

Second, emacs-23 is UTF-8 aware and gets confused by mouse events having x>160 and y>94; in those cases xterm's encoding for x and y looks like a two-byte UTF-8 character (e.g. 0xC2 0x80) and as a result the mouse sequence seems one character short.

I'm working on a patch for xterm to make mouse events emit UTF-8 (which would both unconfuse emacs-23 and allow terminals up to 2047x2047), but I'm not sure yet how it will turn out.



回答3:

I think the problem that caused your workaround (and the upstream fix that was included in one of the v22 releases) to stop working in 23.2 is within Emacs itself. 23.1 can handle mouse clicks after column 95 using urxvt, gnu screen, putty or iTerm, but 23.2 can't. Setting everything set to latin-1 makes no difference. 23.1 has the same code in xt-mouse.el. src/lread.c and src/character.h changed, however, and at a glance I'd guess the bug is in there somewhere. As to what happens after column 223, I've got no clue.

For the benefit of anyone else who's annoyed by the xt-mouse regression in 23.2 here's a modified version of xterm-mouse-event-read that works with mouse clicks up to col 222 (credit to Ryan for the >222 overflow handling which my original fix lacked). This probably won't work in 23.1 or before.

(defun xterm-mouse-event-read ()
  (let ((c (read-char)))
    (cond ((= c 0) #x100)  
       ; for positions past col 222 emacs just delivers
       ; 0x0, best we can do is stay at eol 
      ((= 0 (logand c (- #x100))) c) 
      ((logand c #xff))))) 

... Edit: Here's the version from Emacs 24 (bzr head). It works again in 23.2 up to col 222, but lacks the >222 overflow eol handling Ryan suggested:

(defun xterm-mouse-event-read ()
  (let ((c (read-char)))
    (if (> c #x3FFF80)
        (+ 128 (- c #x3FFF80))
      c)))


回答4:

While xterm now works in utf-8 mode with a patch, this utf-8 hack will break in the worst possible way in any other locale, as the unicode characters will just be dropped unless representable.

rxvt-unicode has (in releases after 9.09) a 1015 mode that sends replies of the form "ESC [ code ; x ; y M", using decimal numbers. This has the advantage of not needing any probing from apps and also working in non-utf-8 locales.