Remote-shell protocols traditionally work by conveying a byte-stream from the server to the client, to be interpreted by the client's terminal. (This includes TELNET, RLOGIN, and SSH.) Mosh works differently and at a different layer. With Mosh, the server and client both maintain a snapshot of the current screen state. The problem becomes one of state-synchronization: getting the client to the most recent server-side screen as efficiently as possible.
This is accomplished using a new protocol called the State Synchronization Protocol, for which Mosh is the first application. SSP runs over UDP, synchronizing the state of any object from one host to another. Datagrams are encrypted and authenticated using AES-128 in OCB3 mode. While SSP takes care of the networking protocol, it is the implementation of the object being synchronized that defines the ultimate semantics of the protocol.
Roaming with SSP becomes easy: the client sends datagrams to the server with increasing sequence numbers, including a "heartbeat" at least once every three seconds. Every time the server receives an authentic packet from the client with a sequence number higher than any it has previously received, the IP source address of that packet becomes the server's new target for its outgoing packets. By doing roaming “statelessly” in this manner, roaming works in and out of NATs, even ones that may themselves be roaming. Roaming works even when the client is not aware that its Internet-visible IP address has changed. The heartbeats allow Mosh to inform the user when it hasn't heard from the server in a while (unlike SSH, where users may be unaware of a dropped connection until they try to type).
Mosh runs two copies of SSP, one in each direction of the connection. The connection from client to server synchronizes an object that represents the keys typed by the user, and with TCP-like semantics. The connection from server to client synchronizes an object that represent the current screen state, and the goal is always to convey the client to the most recent server-side state, possibly skipping intermediate frames.
Because SSP works at the object layer and can control the rate of synchronization (in other words, the frame rate), it does not need to send every byte it receives from the application. That means Mosh can regulate the frames so as not to fill up network buffers, retaining the responsiveness of the connection and making sure Control-C always works quickly. Protocols that must send every byte can't do this.
Careful terminal emulation
One benefit of working at the terminal layer was the opportunity to build a clean UTF-8 terminal emulator from scratch. Mosh fixes several Unicode bugs in existing terminals and in SSH, and was designed as a fresh start to try to be robust and correct even for pathological inputs.
Tricky unicode Only Mosh and the OS X Terminal correctly handle a Unicode combining character in the first column.
xterm: circumflex on wrong letter. xterm: circumflex on wrong letter.
GNOME Terminal: no circumflex at all. GNOME Terminal: no circumflex at all.
OS X Terminal.app gets it right. OS X Terminal.app gets it right.
Mosh gets it right too. Mosh gets it right too. ISO 2022 locking escapes Only Mosh will never get stuck in hieroglyphs when a nasty program writes to the terminal. (See Markus Kuhn's discussion of the relationship between ISO 2022 and UTF-8.)
xterm xterm
GNOME Terminal GNOME Terminal
OS X Terminal.app OS X Terminal.app
Mosh Mosh Evil escape sequences Only Mosh and GNOME Terminal have a defensible rendering when Unicode mixes with an ECMA-48/ANSI escape sequence. The OS X Terminal unwisely tries to normalize its input before the vt500 state machine, causing it to misinterpret and become unusable after receiving the following input!* (This also means the OS X Terminal's interpretation of the incoming octet stream varies depending on how the incoming octets are split across TCP segments, because the normalization only looks ahead to available bytes.) * We earlier wrote that this misbehaving sequence "crashes" the OS X Terminal.app. This was mistaken—instead, Terminal.app interprets the escape sequence as shutting off keyboard input, and because of an unrelated bug in Terminal.app, it is not possible for the user to restore keyboard input by resetting the terminal from the menu.
xterm: circumflex on wrong letter. xterm: circumflex on wrong letter.
GNOME Terminal's circumflex placement is defensible. GNOME Terminal's circumflex placement is defensible.
OS X Terminal.app applies circumflex to part of escape sequence, then irretrievably shuts off keyboard input. OS X Terminal.app applies circumflex to part of escape sequence, then irretrievably shuts off keyboard input.
Mosh gets this one right. Mosh gets this one right. Mosh sets IUTF8 In the POSIX framework, the kernel needs to know whether the user is typing in an 8-bit character set or in UTF-8, because in canonical mode (i.e. "cooked" mode), the kernel needs to be able to delete a typed multibyte character sequence from an input buffer. On OS X and Linux, this is done with the "IUTF8" termios flag.) (See diagnostic explaining the need for this flag.) Mosh sets the IUTF8 flag when possible and stubbornly refuses to start up unless the user has a UTF-8-clean environment. SSH does not set the IUTF8 flag, which can lead to garbage in input buffers.
Instant local echo and line editing
The other major benefit of working at the terminal-emulation layer is that the Mosh client is free to scribble on the local screen without lasting consequence. We use this to implement intelligent local echo. The client runs a predictive model in the background of the server's behavior, hypothesizing that each keystroke will be echoed at the cursor location and that the backspace and left- and right-arrow keys will have their traditional effect. But only when a prediction is confirmed by the server are these effects actually shown to the user. (In addition, by default predictions are only displayed on high-delay connections or during a network “glitch.”) Predictions are done in epochs: when the user does something that might alter the echo behavior — like hit ESC or carriage return or an up- or down-arrow — Mosh goes back into making background predictions until a prediction from the new batch can be confirmed as correct.
Thus, unlike previous attempts at local echo with TELNET and RLOGIN, Mosh's local echo can be used everywhere, even in full-screen programs like emacs and vi.
Real-world benefits
We evaluated Mosh using traces contributed by six users, covering about 40 hours of real-world usage and including 9,986 total keystrokes. These traces included the timing and contents of all writes from the user to the host and vice versa. The users were asked to contribute "typical, real-world sessions." In practice, the traces include use of popular programs such as the bash shell and zsh shells, the alpine and mutt e-mail clients, the emacs and vim text editors, the irssi and barnowl chat clients, the links text-mode Web browser, and several programs unique to each user.
To evaluate typical usage of a "mobile" terminal, we replayed the traces over an otherwise unloaded Sprint commercial EV-DO (3G) cellular Internet connection in Cambridge, Mass. A client-side process played the user portion of the traces, and a server-side process waited for the expected user input and then replied (in time) with the prerecorded server output. We speeded up long periods with no activity. The average round-trip time on the link was about half a second.
We replayed the traces over two different transports, SSH and Mosh, and recorded the user interface response latency to each simulated user keystroke. The Mosh predictive algorithm was frozen prior to collecting the traces and was not adjusted in response to their contents or results.
The results
Cumulative distribution of keystroke response times with Sprint EV-DO (3G) Internet service
Mosh reduced the median keystroke response time from 503 ms to nearly instant (because more than 70% of the keystrokes could be immediately displayed), and reduced the mean keystroke response time from 515 ms to 173 ms. Qualitatively, Mosh makes remote servers "feel" more like the local machine!