Xterm.js: Buffer performance improvements

Created on 13 Jul 2017  ·  73Comments  ·  Source: xtermjs/xterm.js

Problem

Memory

Right now our buffer is taking up too much memory, particularly for an application that launches multiple terminals with large scrollbacks set. For example, the demo using a 160x24 terminal with 5000 scrollback filled takes around 34mb memory (see https://github.com/Microsoft/vscode/issues/29840#issuecomment-314539964), remember that's just a single terminal and 1080p monitors would likely use wider terminals. Also, in order to support truecolor (https://github.com/sourcelair/xterm.js/issues/484), each character will need to store 2 additional number types which will almost double the current memory consumption of the buffer.

Slow fetching of a row's text

There is the other problem of needing to fetch the actual text of a line swiftly. The reason this is slow is due to the way that the data is laid out; a line contains an array of characters, each having a single character string. So we will construct the string and then it will be up for garbage collection immediately afterwards. Previously we didn't need to do this at all because the text is pulled from the line buffer (in order) and rendered to the DOM. However, this is becoming an increasingly useful thing to do though as we improve xterm.js further, features like the selection and links both pull this data. Again using the 160x24/5000 scrollback example, it takes 30-60ms to copy the entire buffer on a Mid-2014 Macbook Pro.

Supporting the future

Another potential problem in the future is when we look at introducing some view model which may need to duplicate some or all of the data in the buffer, this sort of thing will be needed to implement reflow (https://github.com/sourcelair/xterm.js/issues/622) properly (https://github.com/sourcelair/xterm.js/pull/644#issuecomment-298058556) and maybe also needed to properly support screen readers (https://github.com/sourcelair/xterm.js/issues/731). It would certainly be good to have some wiggle room when it comes to memory.

This discussion started in https://github.com/sourcelair/xterm.js/issues/484, this goes into more detail and proposes some additional solution.

I'm leaning towards solution 3 and moving towards solution 5 if there is time and it shows a marked improvement. Would love any feedback! /cc @jerch, @mofux, @rauchg, @parisk

1. Simple solution

This is basically what we're doing now, just with truecolor fg and bg added.

// [0]: charIndex
// [1]: width
// [2]: attributes
// [3]: truecolor bg
// [4]: truecolor fg
type CharData = [string, number, number, number, number];

type LineData = CharData[];

Pros

  • Very simple

Cons

  • Too much memory consumed, would nearly double our current memory usage which is already too high.

2. Pull text out of CharData

This would store the string against the line rather than the line, this would probably see very large gains in selection and linkifying and would be more useful as time goes on having quick access to a line's entire string.

interface ILineData {
  // This would provide fast access to the entire line which is becoming more
  // and more important as time goes on (selection and links need to construct
  // this currently). This would need to reconstruct text whenever charData
  // changes though. We cannot lazily evaluate text due to the chars not being
  // stored in CharData
  text: string;
  charData: CharData[];
}

// [0]: charIndex
// [1]: attributes
// [2]: truecolor bg
// [3]: truecolor fg
type CharData = Int32Array;

Pros

  • No need to reconstruct the line whenever we need it.
  • Lower memory than today due to the use of an Int32Array

Cons

  • Slow to update individual characters, the entire string would need to be regenerated for single character changes.

3. Store attributes in ranges

Pulling the attributes out and associating them with a range. Since there can never be overlapping attributes, this can be laid out sequentially.

type LineData = CharData[]

// [0]: The character
// [1]: The width
type CharData = [string, number];

class CharAttributes {
  public readonly _start: [number, number];
  public readonly _end: [number, number];
  private _data: Int32Array;

  // Getters pull data from _data (woo encapsulation!)
  public get flags(): number;
  public get truecolorBg(): number;
  public get truecolorFg(): number;
}

class Buffer extends CircularList<LineData> {
  // Sorted list since items are almost always pushed to end
  private _attributes: CharAttributes[];

  public getAttributesForRows(start: number, end: number): CharAttributes[] {
    // Binary search _attributes and return all visible CharAttributes to be
    // applied by the renderer
  }
}

Pros

  • Lower memory than today even though we're also storing truecolor data
  • Can optimize application of attributes, rather than checking every single character's attribute and diffing it to the one before
  • Encapsulates the complexity of storing the data inside an array (.flags instead of [0])

Cons

  • Changing attributes of a range of characters inside another range is more complex

4. Put attributes in a cache

The idea here is to leverage the fact that there generally aren't that many styles in any one terminal session, so we should not create as few as necessary and reuse them.

// [0]: charIndex
// [1]: width
type CharData = [string, number, CharAttributes];

type LineData = CharData[];

class CharAttributes {
  private _data: Int32Array;

  // Getters pull data from _data (woo encapsulation!)
  public get flags(): number;
  public get truecolorBg(): number;
  public get truecolorFg(): number;
}

interface ICharAttributeCache {
  // Never construct duplicate CharAttributes, figuring how the best way to
  // access both in the best and worst case is the tricky part here
  getAttributes(flags: number, fg: number, bg: number): CharAttributes;
}

Pros

  • Similar memory usage to today even though we're also storing truecolor data
  • Encapsulates the complexity of storing the data inside an array (.flags instead of [0])

Cons

  • Less memory savings than the ranges approach

5. Hybrid of 3 & 4

type LineData = CharData[]

// [0]: The character
// [1]: The width
type CharData = [string, number];

class CharAttributes {
  private _data: Int32Array;

  // Getters pull data from _data (woo encapsulation!)
  public get flags(): number;
  public get truecolorBg(): number;
  public get truecolorFg(): number;
}

interface CharAttributeEntry {
  attributes: CharAttributes,
  start: [number, number],
  end: [number, number]
}

class Buffer extends CircularList<LineData> {
  // Sorted list since items are almost always pushed to end
  private _attributes: CharAttributeEntry[];
  private _attributeCache: ICharAttributeCache;

  public getAttributesForRows(start: number, end: number): CharAttributeEntry[] {
    // Binary search _attributes and return all visible CharAttributeEntry's to
    // be applied by the renderer
  }
}

interface ICharAttributeCache {
  // Never construct duplicate CharAttributes, figuring how the best way to
  // access both in the best and worst case is the tricky part here
  getAttributes(flags: number, fg: number, bg: number): CharAttributes;
}

Pros

  • Protentially the fastest and most memory efficient
  • Very memory efficient when the buffer contains many blocks with styles but only from a few styles (the common case)
  • Encapsulates the complexity of storing the data inside an array (.flags instead of [0])

Cons

  • More complex than the other solutions, it may not be worth including the cache if we already keep a single CharAttributes per block?
  • Extra overhead in CharAttributeEntry object
  • Changing attributes of a range of characters inside another range is more complex

6. Hybrid of 2 & 3

This takes the solution of 3 but also adds in a lazily evaluates text string for fast access to the line text. Since we're also storing the characters in CharData we can lazily evaluate it.

type LineData = {
  text: string,
  CharData[]
}

// [0]: The character
// [1]: The width
type CharData = [string, number];

class CharAttributes {
  public readonly _start: [number, number];
  public readonly _end: [number, number];
  private _data: Int32Array;

  // Getters pull data from _data (woo encapsulation!)
  public get flags(): number;
  public get truecolorBg(): number;
  public get truecolorFg(): number;
}

class Buffer extends CircularList<LineData> {
  // Sorted list since items are almost always pushed to end
  private _attributes: CharAttributes[];

  public getAttributesForRows(start: number, end: number): CharAttributes[] {
    // Binary search _attributes and return all visible CharAttributes to be
    // applied by the renderer
  }

  // If we construct the line, hang onto it
  public getLineText(line: number): string;
}

Pros

  • Lower memory than today even though we're also storing truecolor data
  • Can optimize application of attributes, rather than checking every single character's attribute and diffing it to the one before
  • Encapsulates the complexity of storing the data inside an array (.flags instead of [0])
  • Faster access to the actual line string

Cons

  • Extra memory due to hanging onto line strings
  • Changing attributes of a range of characters inside another range is more complex

Solutions that won't work

  • Storing the string as an int inside an Int32Array will not work as it takes far to long to convert the int back to a character.
areperformance typplan typproposal

Most helpful comment

Current state:

After:

All 73 comments

Another approach that could be mixed in: use indexeddb, websql or filesystem api to page out inactive scrollback entries to disk 🤔

Great proposal. I agree with that 3. is the best way to go for now as it lets us save memory while supporting true color as well.

If we reach there and things continue to go well, we can then optimize as proposed in 5. or in any other way that comes in our minds at that time and makes sense.

3. is great 👍.

@mofux, while there is definitely a use case for using disk-storage-backed techniques to reduce the memory footprint, this can degrade the user experience of the library in browser environments that ask the user for permission for using disk storage.

Regarding Supporting the future:
The more I think about it, the more the idea of having a WebWorker that does all the heavy work of parsing the tty data, maintaining the line buffers, matching links, matching search tokens and such appeals to me. Basically doing the heavy work in a separate background thread without blocking the ui. But I think this should be part of a separate discussion maybe towards a 4.0 release 😉

+100 about WebWorker in the future, but I think we need change list versions of browsers which we are supporting because not all off them can use it...

When I say Int32Array, this will be a regular array if it is not supported by the environment.

@mofux good thinking with WebWorker in the future 👍

@AndrienkoAleksandr yeah if we wanted to use WebWorker we would need to still support the alternative as well via feature detection.

Wow nice list :)

I also tend to lean towards 3. since it promises a big cut in memory consumption for over 90% of the typical terminal usage. Imho memory optimisation should be the main goal at this stage. Further optimization for specific use cases might be applicable on top of this (what comes to my mind: "canvas like apps" like ncurses and such will use tons of single cell updates and kinda degrade the [start, end] list over time).

@AndrienkoAleksandr yeah I like the webworker idea too since it could lift _some_ burden from the mainthread. Problem here is (beside the fact that it might not be supported by all wanted target systems) is the _some_ - the JS part is not such a big deal anymore with all the optimizations, xterm.js has seen over the time. Real issue performance-wise is the layouting/rendering of the browser...

@mofux The paging thing to some "foreign memory" is a good idea, though it should be part of some higher abstraction and not of the "gimme an interactive terminal widget thing" that xterm.js is. This could be achieved by an addon imho.

Offtopic: Did some tests with arrays vs. typedarrays vs. asm.js. All I can say - OMG, it is like 1 : 1,5 : 10 for simple variable loads and sets (on FF even more). If pure JS speed really starts to hurt, "use asm" might be there for the rescue. But I would see this as a last resort since it would imply fundamental changes. And webassembly is not yet ready to ship.

Offtopic: Did some tests with arrays vs. typedarrays vs. asm.js. All I can say - OMG, it is like 1 : 1,5 : 10 for simple variable loads and sets (on FF even more)

@jerch to clarify, is that arrays vs typedarrays is 1:1 to 1:5?

Woops nice catch with the comma - i meant 10:15:100 speed wise. But only on FF typed arrays were slightly faster than normal arrays. asm is at least 10 times faster than js arrays on all browsers - tested with FF, webkit (Safari), blink/V8 (Chrome, Opera).

@jerch cool, a 50% speed up from typedarrays in addition to better memory would definitely be worth investing in for now.

Idea for memory saving - maybe we could get rid of the width for every character. Gonna try to implement a less expensive wcwidth version.

@jerch we need to access it quite a bit, and we can't lazy load it or anything because when reflow comes we will need the width of every character in the buffer. Even if it was fast we might still want to keep it around.

Might be better to make it optional, assuming 1 if it's not specified:

type CharData = [string, number?]; // not sure if this is valid syntax

[
  // 'a'
  ['a'],
  // '文'
  ['文', 2],
  // after wide
  ['', 0],
  ...
]

@Tyriar Yeah - well since Ive already written it, please have a look at it in PR #798
Speedup is 10 to 15 times on my computer for the cost of 16k bytes for the lookup table. Maybe a combination of both is possible if still needed.

Some more flags we'll support in the future: https://github.com/sourcelair/xterm.js/issues/580

Another thought: Only the bottom portion of the terminal (Terminal.ybase to Terminal.ybase + Terminal.rows) is dynamic. The scrollback which makes up the bulk of the data is completely static, perhaps we can leverage this. I didn't know this until recently, but even things like delete lines (DL, CSI Ps M) don't bring the scrollback back down but rather insert another line. Similarly, scroll up (SU, CSI Ps S) deletes the item at Terminal.scrollTop and inserts an item at Terminal.scrollBottom.

Managing the bottom dynamic portion of the terminal independently and pushing to scrollback when the line is pushed out could lead to some significant gains. For example, the bottom portion could be more verbose to favor modifying of attributes, faster access, etc. whereas the scrollback can be more of an archival format as proposed in the above.

Another thought: it's probably a better idea to restrict CharAttributeEntry to rows as that's how most applications seem to work. Also if the terminal is resized then "blank" padding is added to the right which doesn't share the same styles.

eg:

screen shot 2017-08-07 at 8 51 52 pm

To the right of the red/green diffs are unstyled "blank" cells.

@Tyriar
Any chance to get this issue back on the agenda? At least for output intensive programs a different way of holding the terminal data might save a lot of memory and time. Some hybrid of 2/3/4 will give a huge throughput boost, if we could avoid splitting and saving single chars of the input string. Also saving the attributes only once they changed will help to save memory.

Example:
With the new parser we could save a bunch of input characters without messing around with the attributes, since we know that they will not change in the middle of that string. The attributes of that string could be saved in some other data structure or attribute along with wcwidths (yeah we still need those to find they line break) and line breaks and stops. This would basically give up the cell model when data is incoming.
Problem arises if something steps in and wants to have a drill-down representation of the terminal data (e.g. the renderer or some escape sequence/user wants to move the cursor). We still have to do the cell calculations if that happens, but it should be sufficient to do this only for the content within the terminal cols and rows. (Not sure yet about the scrolled out content, that might be even further cachable and cheap to redraw.)

@jerch I'll be meeting @mofux one day in Prague in a couple of weeks and we were going to do/start some internal improvements of how text attributes are handled which covers this 😃

From https://github.com/xtermjs/xterm.js/pull/1460#issuecomment-390500944

The algo is kinda expensive since every char needs to be evaluate twice

@jerch if you have any ideas on faster access of text from the buffer, let us know. Currently most of it is just a single character as you know, but it could be an ArrayBuffer, a string, etc. I've been thinking we should think about taking more advantage of scrollback being immutable somehow.

Well I have experimented alot with ArrayBuffers in the past:

  • they are slightly worse than Array regarding runtime for the typical methods (maybe still less optimized by engine vendors)
  • new UintXXArray is far worse than literal array creation with []
  • they pay off several times if you can prealloc and reuse the data structure (up to 10 times), this is where the linked list nature of mixed arrays eats the performance due to heavy alloc and gc behind the scenes
  • for string data the forth and back conversion eats all benefits - a pity JS does not provide native string to Uint16Array converters (partly doable with TextEncoder though)

My findings about ArrayBuffer suggest not to use them for string data due to the conversion penalty. In theory the terminal could use ArrayBuffer from node-pty up to the terminal data (this would save several conversions on the way to the frontend), not sure if the rendering can be done that way, I think to render stuff it always needs a final uint16_t to string conversion. But even that one last string creation will eat most of the runtime saved - and furthermore, would turn the terminal internally into an ugly C-ish beast. Therefore I gave up this approach.

TL;DR ArrayBuffer is superior if you can prealloc and reuse the data structure. For everything else normal arrays are better. Strings are not worth to be squeezed into ArrayBuffers.

A new idea I came up with tries to lower string creation as much as possible, esp. tries to avoid the nasty splits and joins. It is kinda based on your 2nd idea above with the new InputHandler.print method, wcwidth and line stops in mind:

  • print now gets whole strings up to several terminal lines
  • save those strings in a simple pointer list without any alteration (no string alloc or gc, list alloc can be avoided if used with a prealloc'd structure) along with current attributes
  • advance cursor by wcwidth(string) % cols
  • special case \n (hard line break): advance cursor by one line, mark position in pointer list as hard break
  • special case line overflow with wrapAround: mark position in string as soft line break
  • special case \r: load last line content (from current cursor position to last line break) into some line buffer to get overwritten
  • data flows like above, despite the \r case no cell abstraction nor string splitting is needed
  • attribute changes are no problem, as long as noone requests the real cols x rows representation (they just change the attr flag that gets saved along with the whole string)

Btw the wcwidths are a subset of the grapheme algo, so this might be interchangeable in the future.

Now the hazardous part 1 - someone wants to move the cursor within the cols x rows:

  • move cols backwards in line breaks - the beginning of the current terminal content
  • every line break denotes a real terminal line
  • load stuff into cell model for just one page (not sure yet if this can also be omitted with clever string positioning)
  • do the nasty work: if attribute changes are requested we are kinda out of luck and have to fall back to either the full cell model or a string split & insert model (the latter might introduce bad performance)
  • data flows again, now with degraded strings & attrs data in the buffer for that page

Now the hazardous part 2 - the renderer wants to draw something:

  • kinda depends on the renderer if we need to drilldown to a cell model or can just provide the string offsets with line breaks and text attrs

Pros:

  • very fast data flow
  • optimized for the most common InputHandler method - print
  • makes reflowing lines upon terminal resize possible

Cons:

  • almost every other InputHandler method will be hazardous in the sense of interrupting this flow model and the need of some intermediate cell abstraction
  • renderer integration unclear (to me at least atm)
  • might degrade performance for curses like apps (they typically contain more "hazardous" sequences)

Well this is a rough draft of the idea, far from being useable atm since many details are not covered yet. Esp. the "hazardous" parts could get nasty with many performance problems (such as degrading the buffer with even worse gc behavior etc pp)

@jerch

Strings are not worth to be squeezed into ArrayBuffers.

I think Monaco stores its buffer in ArrayBuffers and is pretty high performance. I haven't looked too deeply into the implementation yet.

esp. tries to avoid the nasty splits and joins

Which ones?

I've been thinking we should think about taking more advantage of scrollback being immutable somehow.

One idea was to separate scrollback from the viewport section. Once a line goes to scrollback it gets pushed into the scrollback data structure. You could imagine 2 CircularList objects, one whose lines are optimized for never changing, one for the opposite.

@Tyriar About the scrollback - yes since this is never reachable by the cursor it might save some memory to just drop the cell abstraction for scrolled out lines.

@Tyriar
It makes sense to store strings in ArrayBuffer if we can limit the conversion down to one (maybe the final one for render output). That is slightly better than string handling all over the place. This would be doable since node-pty can give raw data as well (and also the websocket can give us raw data).

esp. tries to avoid the nasty splits and joins

Which ones?

The whole approach is to avoid _minimize_ splits at all. If noone requests cursor jumps into the buffered data, strings would never be split and could go right to the renderer (if supported). No cell splits and later joins at all.

@jerch well it can if the viewport is expanded, I think we may also pull in the scrollback when a line is deleted? Not 100% sure on that or even if it's correct behavior.

@Tyriar Ah right. Not sure about the latter either, I think native xterm allows this only for real mouse or scrollbar scrolling. Even SD/SU does not move scrollbuffer content back into the "active" terminal viewport.

Could you point me to the source of the monaco editor, where the ArrayBuffer is used? Seems I cant find it myself :blush:

Hmm just reread the TextEncoder/Decoder spec, with ArrayBuffers from node-pty up to the frontend we are basically stuck with utf-8, unless we translate it the hard way at some point. Making xterm.js utf-8 aware? Idk, this would involve many intermediate codepoint calculations for the higher unicode chars. Proside - it would save memory for ascii chars.

@rebornix could you give us some pointers to where monaco stores the buffer?

Here are some numbers for typed arrays and the new parser (was easier to adopt):

  • UTF-8 (Uint8Array): print action jumps from 190 MB/s to 290 MB/s
  • UTF-16 (Uint16Array): print action jumps from 190 MB/s to 320 MB/s

Overall UTF-16 performs much better, but that was expected since the parser is optimized for that. UTF-8 suffers from the intermediate codepoint calculation.

The string to typed array conversion eats ~4% JS runtime of my benchmark ls -lR /usr/lib (always far below 100 ms, done via a loop in InputHandler.parse). I did not test the reverse conversion (this is implicitly done atm in InputHandller.print on the cell by cell level). The overall runtime is slightly worse than with strings (the saved time in the parser does not compensate the conversion time). This might change when other parts are also typed array aware.

And the corresponding screenshots (tested with ls -lR /usr/lib):

with strings:
grafik

with Uint16Array:
grafik

Note difference for EscapeSequenceParser.parse, which can profit from a typed array (~30% faster). The InputHandler.parse does the conversion, thus it is worse for the typed array version. Also GC Minor has more to do for typed array (since I throw the array away).

Edit: Another aspect can be seen in the screenshots - the GC becomes relevant with ~20% runtime, the long running frames (red flagged) are all GC related.

Just another somewhat radical idea:

  1. Create own arraybuffer based virtual memory, something big (>5 MB)
    If the arraybuffer has a length of multiples of 4 transparent switches from int8 to int16 to int32 types are possible. The allocator returns a free index on the Uint8Array, this pointer can be converted to a Uint16Array or Uint32Array position by a simple bit shift.
  2. Write incoming strings into the memory as uint16_t type for UTF-16.
  3. Parser runs on the string pointers and calls methods in InputHandler with pointers to this memory instead of string slices.
  4. Create the terminal data buffer inside the virtual memory as a ring buffer array of a struct like type instead of native JS objects, maybe like this (still cell based):
struct Cell {
    uint32_t *char_start;  // start pointer of cell content (JS with pointers hurray!)
    uint8_t length;        // length of content (8 bit here is sufficient)
    uint32_t attr;         // text attributes (might grow to hold true color someday)
    uint8_t width;         // wcwidth (maybe merge with other member, always < 4)
    .....                  // some other cell based stuff
}

Pros:

  • omits JS objects and thus GC where possible (only few local objects will remain)
  • only one initial data copy into the virtual memory needed
  • almost no malloc and free costs (depends on cleverness of the allocator/deallocator)
  • will save alot of memory (avoids the JS objects memory overhead)

Cons:

  • Welcome to the Cavascript Horror Show :scream:
  • hard to implement, changes kinda everything
  • speed benefit is unclear until really implemented

:smile:

hardfun to implement, changes kinda everything 😉

This is closer to how Monaco works, I remembered this blog post which discusses the strategy for storing character metadata https://code.visualstudio.com/blogs/2017/02/08/syntax-highlighting-optimizations

Yup thats basically the same idea.

Hope my answer to where monaco stores the buffer is not too late.

Alex and I are in favor of Array Buffer and most of the time it gives us good performance. Some places we use ArrayBuffer:

We use simple strings for text buffer instead of Array Buffer as V8 string is easier to manipulate

  • We do the encoding/decoding at the very beginning of loading a file, so files are converted JS string. V8 decides whether to use one byte or two to store a character.
  • We do edits on the text buffer very often, strings are easier to handle.
  • We are using nodejs native module and have access to V8 internals when necessary.

The following list is just a quick summary of interesting concepts I stumbled over that might help to lower memory usage and/or runtime:

  • FlatJS (https://github.com/lars-t-hansen/flatjs) - meta language to help coding with arraybuffer based heaps
  • http://2ality.com/2017/01/shared-array-buffer.html (announced as part of ES2017, future might be uncertain due to Spectre, beside that very promising idea with real concurrency and real atomics)
  • webassembly/asm.js (current state? usable yet? Have not followed its development for some time, used emscripten to asm.js years ago with a C lib for a game AI with impressive results though)
  • https://github.com/AssemblyScript/assemblyscript

To get sumthing rolling here, here is a quick hack how we could "merge" text attributes.

The code is mainly driven by the idea to save memory for the buffer data (runtime will suffer, not tested yet how much). Esp. the text attributes with RGB for foreground and background (once supported) will make xterm.js eat tons of memory by the current cell by cell layout. The code tries to circumvent this by the usage of a resizable ref-counting atlas for attributes. This is imho an option since a single terminal will hardly hold more than 1M cells, which would grow the atlas to 1M * entry_size if all cells differ.

The cell itself just needs to hold the index of the attribute atlas. On cell changes the old index needs to be unref'd and the new one ref'd. The atlas index would replace the attribute attribute of the terminal object and will be changed itself in SGR.

The atlas currently only addresses text attributes, but could be extended to all cell attributes if needed. While the current terminal buffer holds 2 32bit numbers for attribute data (4 with RGB in the current buffer design) the atlas would reduce it to only one 32bit number. The atlas entries can be packed further too.

interface TextAttributes {
    flags: number;
    foreground: number;
    background: number;
}

const enum AtlasEntry {
    FLAGS = 1,
    FOREGROUND = 2,
    BACKGROUND = 3
}

class TextAttributeAtlas {
    /** data storage */
    private data: Uint32Array;
    /** flag lookup tree, not happy with that yet */
    private flagTree: any = {};
    /** holds freed slots */
    private freedSlots: number[] = [];
    /** tracks biggest idx to shortcut new slot assignment */
    private biggestIdx: number = 0;
    constructor(size: number) {
        this.data = new Uint32Array(size * 4);
    }
    private setData(idx: number, attributes: TextAttributes): void {
        this.data[idx] = 0;
        this.data[idx + AtlasEntry.FLAGS] = attributes.flags;
        this.data[idx + AtlasEntry.FOREGROUND] = attributes.foreground;
        this.data[idx + AtlasEntry.BACKGROUND] = attributes.background;
        if (!this.flagTree[attributes.flags])
            this.flagTree[attributes.flags] = [];
        if (this.flagTree[attributes.flags].indexOf(idx) === -1)
            this.flagTree[attributes.flags].push(idx);
    }

    /**
     * convenient method to inspect attributes at slot `idx`.
     * For better performance atlas idx and AtlasEntry
     * should be used directly to avoid number conversions.
     * @param {number} idx
     * @return {TextAttributes}
     */
    getAttributes(idx: number): TextAttributes {
        return {
            flags: this.data[idx + AtlasEntry.FLAGS],
            foreground: this.data[idx + AtlasEntry.FOREGROUND],
            background: this.data[idx + AtlasEntry.BACKGROUND]
        };
    }

    /**
     * Returns a slot index in the atlas for the given text attributes.
     * To be called upon attributes changes, e.g. by SGR.
     * NOTE: The ref counter is set to 0 for a new slot index, thus
     * values will get overwritten if not referenced in between.
     * @param {TextAttributes} attributes
     * @return {number}
     */
    getSlot(attributes: TextAttributes): number {
        // find matching attributes slot
        const sameFlag = this.flagTree[attributes.flags];
        if (sameFlag) {
            for (let i = 0; i < sameFlag.length; ++i) {
                let idx = sameFlag[i];
                if (this.data[idx + AtlasEntry.FOREGROUND] === attributes.foreground
                    && this.data[idx + AtlasEntry.BACKGROUND] === attributes.background) {
                    return idx;
                }
            }
        }
        // try to insert into a previously freed slot
        const freed = this.freedSlots.pop();
        if (freed) {
            this.setData(freed, attributes);
            return freed;
        }
        // else assign new slot
        for (let i = this.biggestIdx; i < this.data.length; i += 4) {
            if (!this.data[i]) {
                this.setData(i, attributes);
                if (i > this.biggestIdx)
                    this.biggestIdx = i;
                return i;
            }
        }
        // could not find a valid slot --> resize storage
        const data = new Uint32Array(this.data.length * 2);
        for (let i = 0; i < this.data.length; ++i)
            data[i] = this.data[i];
        const idx = this.data.length;
        this.data = data;
        this.setData(idx, attributes);
        return idx;
    }

    /**
     * Increment ref counter.
     * To be called for every terminal cell, that holds `idx` as text attributes.
     * @param {number} idx
     */
    ref(idx: number): void {
        this.data[idx]++;
    }

    /**
     * Decrement ref counter. Once dropped to 0 the slot will be reused.
     * To be called for every cell that gets removed or reused with another value.
     * @param {number} idx
     */
    unref(idx: number): void {
        this.data[idx]--;
        if (!this.data[idx]) {
            let treePart = this.flagTree[this.data[idx + AtlasEntry.FLAGS]];
            treePart.splice(treePart.indexOf(this.data[idx]), 1);
        }
    }
}

let atlas = new TextAttributeAtlas(2);
let a1 = atlas.getSlot({flags: 12, foreground: 13, background: 14});
atlas.ref(a1);
// atlas.unref(a1);
let a2 = atlas.getSlot({flags: 12, foreground: 13, background: 15});
atlas.ref(a2);
let a3 = atlas.getSlot({flags: 13, foreground: 13, background: 16});
atlas.ref(a3);
let a4 = atlas.getSlot({flags: 13, foreground: 13, background: 16});
console.log(atlas);
console.log(a1, a2, a3, a4);
console.log('a1', atlas.getAttributes(a1));
console.log('a2', atlas.getAttributes(a2));
console.log('a3', atlas.getAttributes(a3));
console.log('a4', atlas.getAttributes(a4));

Edit:
The runtime penalty is almost zero, for my benchmark with ls -lR /usr/lib it adds less than 1 msec to the total runtime of ~2.3 s. Interesting side note - the command sets less than 64 different text attribute slots for the output of 5 MB of data and will save more than 20 MB once fully implemented.

Made some prototype PRs to test some changes to the buffer (see https://github.com/xtermjs/xterm.js/pull/1528#issue-196949371 for the general idea behind the changes):

  • PR #1528 : attribute atlas
  • PR #1529 : remove wcwidth and charCode from buffer
  • PR #1530 : replace string in buffer by codepoints / cell storage index value

@jerch it might be a good idea to stay away from the word atlas for this so that "atlas" always means "texture atlas". Something like store or cache would probably be better?

oh ok, "cache" is fine.

Guess I am done with the testbed PRs. Please also have a look at the PR comments to get the background of the following rough summary.

Proposal:

  1. Build an AttributeCache to hold everything needed to style a single terminal cell. See #1528 for an early ref counting version that can also hold true color specs. The cache can also be shared between different terminal instances if needed to saved further memory in multiple terminal apps.
  2. Build a StringStorage to hold short terminal content data strings. The version in #1530 even avoids storing single char strings by "overloading" the pointer meaning. wcwidth should be moved here.
  3. Shrink the current CharData from [number, string, number, number] to [number, number], where the numbers are pointers (index numbers) to:

    • AttributeCache entry

    • StringStorage entry

The attributes are unlikely to change alot, so a single 32bit number will save much memory over time. The StringStorage pointer is a real unicode codepoint for single chars, therefore can be used as the code entry of CharData. The actual string can be accessed by StringStorage.getString(idx). The fourth field wcwidth of CharData could be accessed by StringStorage.wcwidth(idx) (not yet implemented). There is almost no runtime penalty from getting rid of code and wcwidth in CharData (tested in #1529).

  1. Move the shrunk CharData into a dense Int32Array based buffer implementation. Also tested in #1530 with a stub class (far from fully functional), the final benefits are likely to be:

    • 80% smaller memory footprint of the terminal buffer (from 5.5 MB to 0.75 MB)

    • slightly faster (not testable yet, I expect to gain 20% - 30% speed)

    • Edit: much faster - script runtime for ls -lR /usr/lib dropped to 1.3s (master is at 2.1s) while the old buffer is still active for cursor handling, once removed I expect the runtime to drop below 1s

Downside - step 4 is quite alot work since it will need some rework on the buffer interface. But hey - for saving 80% of the RAM and still gaining runtime performance it is no biggy, is it? :smile:

There is another issue I stumbled over - the current empty cell representation. Imho a cell can have 3 states:

  • empty: initial cell state, nothing got written to it yet or the content was deleted. It has a width of 1 but no content. Currently used in blankLine and eraseChar, but with a space as content.
  • null: cell after a full width char to indicate, that it has no width for visual representation.
  • normal: cell holds some content and has a visual width (1 or 2, maybe bigger once we support real grapheme/bidi stuff, not sure about that yet lol)

Problem I see here is that an empty cell is not distinguishable from a normal cell with a space inserted, both look the same at buffer level (same content, same width). I did not write any of the renderer/output code but I expect this to lead to awkward situations at the output front. Esp. the handling of the right end of a line might get cumbersome.
A terminal with 15 cols, first some string output, that got wrapped around:

1: 'H', 'e', 'l', 'l', 'o', ' ', 't', 'e', 'r', 'm', 'i', 'n', 'a', 'l', ' '
2: 'w', 'o', 'r', 'l', 'd', '!', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '

versus a folder listing with ls:

1: 'R', 'e', 'a', 'd', 'm', 'e', '.', 'm', 'd', ' ', ' ', ' ', ' ', ' ', ' '
2: 'f', 'i', 'l', 'e', 'A', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' '

The first example contains a real space after the word 'terminal', the second example never touched the cells after 'Readme.md'. The way it is represented at buffer level makes perfect sense for the standard case to print the stuff as terminal output to the screen (the room needs to be taken anyways), but for tools that try to deal with the content strings like a mouse selection or a reflow manager it is not clear anymore, where the spaces come from.

More or less this leads to the next question - how to determine the actual content length in a line (amount of cells that contain something from left side)? A simple approach would count the empty cells from the right side, again the double meaning from above makes this hard to determine.

Proposal:
Imho this is easy fixable by using some other placeholder for empty cells, e.g. a control char or the empty string and replace those in the render process if needed. Maybe the screen renderer can also benefit from this, since it might not have to handle those cells at all (depends on the way the output is generated).

Btw, for the wrapped string above this also leads to the isWrapped problem, that is essential for a reflowing resize or a correct copy&paste selection handling. Imho we cannot remove that, but need to integrate it better than it is atm.

@jerch impressive work! :smiley:

1 Build an AttributeCache to hold everything needed to style a single terminal cell. See #1528 for an early ref counting version that can also hold true color specs. The cache can also be shared between different terminal instances if needed to saved further memory in multiple terminal apps.

Made some comments on #1528.

2 Build a StringStorage to hold short terminal content data strings. The version in #1530 even avoids storing single char strings by "overloading" the pointer meaning. wcwidth should be moved here.

Made some comments on #1530.

4 Move the shrunk CharData into a dense Int32Array based buffer implementation. Also tested in #1530 with a stub class (far from fully functional), the final benefits are likely to be:

Not totally sold on this idea yet, I think it will bite us hard when we implement reflow. It does look like each of these steps can pretty much be done in order so we can see how things go and see if it makes sense to do this once we get 3 done.

There is another issue I stumbled over - the current empty cell representation. Imho a cell can have 3 states

Here's an example of a bug that came out of this https://github.com/xtermjs/xterm.js/issues/1286, :+1: to differentiating whitespace cells and "empty" cells

Btw, for the wrapped string above this also leads to the isWrapped problem, that is essential for a reflowing resize or a correct copy&paste selection handling. Imho we cannot remove that, but need to integrate it better than it is atm.

I see isWrapped going away when we tackle https://github.com/xtermjs/xterm.js/issues/622 as CircularList will only contain unwrapped rows.

Not totally sold on this idea yet, I think it will bite us hard when we implement reflow. It does look like each of these steps can pretty much be done in order so we can see how things go and see if it makes sense to do this once we get 3 done.

Yes I am with you (still fun to play around with this totally different approach). 1 and 2 could be cherrypicked, 3 can be applied depending on 1 or 2. 4 is optional, we could just stick to the current buffer layout. The memory savings are like this:

  1. 1 + 2 + 3 in CircularList: saves 50% (~2.8MB of ~5.5 MB)
  2. 1 + 2 + 3 + 4 halfway - just put the the line data in a typed array but stick to the row index access: saves 82% (~0.9MB)
  3. 1 + 2 + 3 + 4 fully dense array with pointer arithmetics: saves 87% (~0.7MB)

1. is very easy to implement, the memory behavior with bigger scrollBack will still show the bad scaling as shown here https://github.com/xtermjs/xterm.js/pull/1530#issuecomment-403542479 but on a less toxic level
2. Slightly harder to implement (some more indirections at line level needed), but will make it possible to keep the higher API of Buffer intact. Imho the option to go for - big mem save and still easy to integrate.
3. 5% more mem save than option 2, hard to implement, will change touch all API and thus literally the whole code base. Imho more of academical interest or for boring rainy days to be implemented lol.

@Tyriar I did some further tests with rust for webassembly usage and rewrote the parser. Note that my rust skills are abit "rusty" since I did not get yet deeper into it, therefore the following might be the result of weak rust code. Results:

  • The data handling within the wasm part is slightly faster (5 - 10%).
  • Calls from JS into wasm create some overhead and eats all the benefits from above. In fact it was ~20% slower.
  • The "binary" will be smaller than the JS counterpart (not really measured since I did not implement all stuff).
  • To get JS <--> wasm transition easily done some bloatcode is needed to handle the JS types (only did the string translation).
  • We cannot avoid JS to wasm translation since the browser DOM and events are not accessible there. It could only be used for core parts, which are not that performance critical anymore (beside the mem consumption).

Unless we want to rewrite the whole core libs in rust (or any other wasm capable language) we cannot gain anything from moving to a wasm lang imho. A plus of nowadays wasm langs is the fact that most support explicit memory handling (could help us with the buffer problem), downsides are the introduction of a totally different language into a primarily TS/JS focused project (a high barrier for code additions) and the translation costs between wasm and JS land.

TL;DR
xterm.js is to broadly into general JS stuff like DOM and events to gain anything from webassembly even for a rewrite of the core parts.

@jerch nice investigation :smiley:

Calls from JS into wasm create some overhead and eats all the benefits from above. In fact it was ~20% slower.

This was the major problem for monaco going native as well which has primarily informed my stance on (though that was with native node module, not wasm). I believe working with ArrayBuffers where ever possible should give us the best balance between perf and simplicity (easy of impl, barrier to entry).

@Tyriar Gonna try to come up with an AttributeStorage to hold the RGB data. Not sure yet about the BST, for the typical use case with only a few color settings in a terminal session this will be worse in runtime, maybe this should be a runtime drop-in once the colors surpass a given theshold. Also memory consumption will raise alot again, though it will still save memory since the attributes are stored only once and not along with every single cell (worst case scenario with every cell holding different attributes will suffer though).
Do you know why the current fg and bg 256 colors value is 9 bit based instead of 8 bits? What is the additional bit used for? Here: https://github.com/xtermjs/xterm.js/blob/6691f809069a549b4808cd2e055398d2da15db37/src/InputHandler.ts#L1596
Could you give me the current bit layout of attr? I think a similar approach like the "double meaning" for the StringStorage pointer can further save memory but that would require the MSB of attr to be reserved for the pointer distinction and not used for any other purpose. This might limit the possibility to support further attribute flags later on (because FLAGS already uses 7 bits), are we still missing some fundamental flags that are likely to come?

A 32 bit attr number in the term buffer could be packed like this:

# 256 indexed colors
32:       0 (no RGB color)
31..25:   flags (7 bits)
24..17:   fg (8 bits, see question above)
16..9:    bg
8..1:     unused

# RGB colors
32:       1 (RGB color)
31..25:   flags (7 bits)
24..1:    pointer to RGB data (address space is 2^24, which should be sufficient)

This way the storage only needs to hold the RGB data in two 32 bit numbers while the flags can stay in the attr number.

@jerch by the way I sent you an email, probably got eaten by the spam filter again 😛

Do you know why the current fg and bg 256 colors value is 9 bit based instead of 8 bits? What is the additional bit used for?

I think it's used for the default fg/bg color (which could be dark or light), so it's actually 257 colors.

https://github.com/xtermjs/xterm.js/pull/756/files

Could you give me the current bit layout of attr?

I think it's this:

19+:     flags (see `FLAGS` enum)
18..18:  default fg flag
17..10:  256 fg
9..9:    default bg flag
8..1:    256 bg

You can see what I landed on for truecolor in the old PR https://github.com/xtermjs/xterm.js/pull/756/files:

/**
 * Character data, the array's format is:
 * - string: The character.
 * - number: The width of the character.
 * - number: Flags that decorate the character.
 *
 *        truecolor fg
 *        |   inverse
 *        |   |   underline
 *        |   |   |
 *   0b 0 0 0 0 0 0 0
 *      |   |   |   |
 *      |   |   |   bold
 *      |   |   blink
 *      |   invisible
 *      truecolor bg
 *
 * - number: Foreground color. If default bit flag is set, color is the default
 *           (inherited from the DOM parent). If truecolor fg flag is true, this
 *           is a 24-bit color of the form 0xxRRGGBB, if not it's an xterm color
 *           code ranging from 0-255.
 *
 *        red
 *        |       blue
 *   0x 0 R R G G B B
 *      |     |
 *      |     green
 *      default color bit
 *
 * - number: Background color. The same as foreground color.
 */
export type CharData = [string, number, number, number, number];

So in this I had 2 flags; one for the default color (whether to ignore all color bits) and one for truecolor (whether to do 256 or 16 mil color).

This might limit the possibility to support further attribute flags later on (because FLAGS already uses 7 bits), are we still missing some fundamental flags that are likely to come?

Yes we want some room for additional flags, for example https://github.com/xtermjs/xterm.js/issues/580, https://github.com/xtermjs/xterm.js/issues/1145, I'd say at least leave > 3 bits where possible.

Instead of pointer data inside the attr itself, there could be another map that holds references to the the rgb data? mapAttrIdxToRgb: { [idx: number]: RgbData

@Tyriar Sorry, was a few days not online and I fear the email got eaten by the spam filter. Could you resend it please? :blush:

Played abit with more clever lookup data structures for the attrs storage. Most promising regarding space and search/insert runtime are trees and a skiplist as a cheaper alternative. In theory lol. In practice neither can outperform my simple array search which seems very weird to me (bug in the code somewhere?)
I uploaded a test file here https://gist.github.com/jerch/ff65f3fb4414ff8ac84a947b3a1eec58 with array vs. a left-leaning red–black tree, that tests up to 10M entries (which is almost the complete addressing space). Still the array is far ahead compared to the LLRB, I suspect the break-even to be around 10M though. Tested on my 7ys old laptop, maybe someone can test it as well and even better - point me to some bugs in the impl/tests.

Here are a some results (with running numbers):

prefilled             time for inserting 1000 * 1000 (summed up, ms)
items                 array        LLRB
100-10000             3.5 - 5      ~13
100000                ~12          ~15
1000000               8            ~18
10000000              20-25        21-28

What really surprises me is the fact that the linear array search does not show any growing in the lower regions at all, it is up to 10k entries stable at ~4ms (might be cache related). The 10M test shows for both a worse runtime than expected, maybe due to mem paging whatsoever. Maybe JS is to far away from the machine with the JIT and all the opts/deopts happening, still I think they cant eliminate a complexity step (though the LLRB seems to be heavy on a single _n_ thus moving the break even point for O(n) vs. O(logn) upwards)

Btw with random data the difference is even worse.

I think it's used for the default fg/bg color (which could be dark or light), so it's actually 257 colors.

So this is distingish SGR 39 or SGR 49 from one of the 8 palette colors?

Instead of pointer data inside the attr itself, there could be another map that holds references to the the rgb data? mapAttrIdxToRgb: { [idx: number]: RgbData

This would introduce another indirection with additional memory usage. With the tests above I also tested the difference between always holding the flags in attrs vs saving them along with RGB data in the storage. Since the difference is ~0.5ms for 1M entries I would not go for this complicated attrs setup, instead copy flags over to the storage once RGB is set. Still I would go for the 32th bit distinction between direct attrs vs pointer since this will avoid the storage at all for non RGB cells.

Also I think the default 8 palette colors for fg/bg are not sufficiently represented in the buffer currently. In theory the terminal should support following color modes:

  1. SGR 39 + SGR 49 default color for fg/bg (customizable)
  2. SGR 30-37 + SGR 40-47 8 low color palette for fg/bg (customizable)
  3. SGR 90-97 + SGR 100-107 8 high color palette for fg/bg (customizable)
  4. SGR 38;5;n + SGR 48;5;n 256 indexed palette for fg/bg (customizable)
  5. SGR 38;2;r;g;b + SGR 48;2;r;g;b RGB for fg/bg (not customizable)

Option 2.) and 3.) can be merge together to a single byte (treating them as a single 16 colors fg/bg palette), 4.) takes 2 bytes and 5.) finally will take 6 more bytes. We still need some bits to indicate the color mode.
To reflect this on the buffer level we would need the following:

bits        for
2           fg color mode (0: default, 1: 16 palette, 2: 256, 3: RGB)
2           bg color mode (0: default, 1: 16 palette, 2: 256, 3: RGB)
8           fg color for 16 palette and 256
8           bg color for 16 palette and 256
10          flags (currently 7, 3 more reserved for future usage)
----
30

So we need 30 bits of a 32 bit number, leaving 2 bits free for other purposes. The 32th bit could hold the pointer vs. direct attr flag omitting the storage for non RGB cells.

Also I suggest to wrap up the attr access into a convenient class to not expose implementation details to the outside (see the test file above, there is an early version of a TextAttributes class to achieve this).

Sorry, was a few days not online and I fear the email got eaten by the spam filter. Could you resend it please?

Resent

Oh btw those numbers above for the array vs llrb search are crap - I think it got spoiled by the optimizer doing some weird stuff in the for loop. With a slightly different test setup it clearly shows the O(n) vs. O(log n) growing much earlier (with 1000 elements prefilled being already faster with the tree).

Current state:

After:

One fairly simple optimization is to flatten the array-of-arrays into a single array. I.e. instead of a BufferLine of _N_ columns having a _data array of _N_ CharData cells, where each CharData is an array of 4, just have a single array of _4*N_ elements.This eliminates the object overhead of _N_ arrays. It also improves cache locality, so it should be faster. A disadvantage is slightly more complicated and uglier code, but it seems worth it.

As a follow-up to my previous comment, it seems worth considering using a variable number of elements in the _data array for each cell. In other words a stateful representation. Random changes in position would be more expensive, but linear scanning from the start of a line can be pretty quick, especially since a simple array is optimized for cache locality. Typical sequential output would be fast, as would be rendering.

In addition to reducing space, an advantage of a variable number of elements per cell is increased flexibility: Extra attributes (such as 24-bit color), annotations for specific cells or ranges, glyphs or
nested DOM elements.

@PerBothner Thx for your ideas! Yeah I already tested the single dense array layout with pointer arithmetics, it shows the best memory utilization. Problems arise when it comes to resize, it basically means to rebuild the whole memory chunk (copy over) or to fast copy into a bigger chunk and realign parts. This is quite exp. and imho not justified by the mem saving (tested in some playground PR listed above, the saving was something around ~10% compared to the new buffer line implementation).

About your second comment - we already discussed that as it would make handling of the wrapped lines easier. For now we decided to go with the row X col approach for the new buffer layout and get that first done. I think we should address this again once we do the reflow resize implementation.

About adding additional stuff to the buffer: we currently do here what most other terminals do - the cursor advance is determined by wcwidth which ensures to stay compatible with pty/termios's idea how the data should be layouted. This basically means that we handle at buffer level only things like surrogate pairs and combining chars. Any other "higher level" joining rules can be applied by the character joiner in the renderer (currently used by https://github.com/xtermjs/xterm-addon-ligatures for ligatures). I had a PR open to also support unicode graphemes at early buffer level but I think we cannot do this at that stage since most pty backends have no notion of this (is there any at all?) and we would end up with weird char conglomerates. Same goes for real BIDI support, I think graphemes and BIDI is better done at the renderer stage to keep the cursor/cell movements intact.

Support for DOM nodes attached to cells sounds really interesting, I like that idea. Currently thats not possible in a direct approach since we have different renderer backends (DOM, canvas 2D and the new shiny webgl renderer), I think this still could be achieved for all renderer by positioning an overlay where not natively supported (only DOM renderer would be able to do it directly). We would need some kind of API at buffer level to announce that stuff and its size and the renderer could to the dirty work. I think we should discuss/track this with a separate issue.

Thanks for your detailed response.

_"Problems arise when it comes to resize, it basically means to rebuild the whole memory chunk (copy over) or to fast copy into a bigger chunk and realign parts."_

Do you mean: on resize we would have to copy _4*N_ elements rather than just _N_ elements?

It might make sense for the array to contain all the cells for a logical (unwrapped) lines. E.g. assume a 180-character line and an 80-column wide terminal. In that case you could have 3 BufferLine instances all sharing the same _4*180_-element _data buffer, but each BufferLine would also contain a start offset.

Well I had everything in one big array that was build by [cols] x [rows] x [needed single cell space]. So it basically still worked as "canvas" with a given height and width. This is really memory efficient and fast for normal input flow, but as soon as insertCell/deleteCell is invoked (a resize would do that), the whole memory behind the position where the action takes place would have to be shifted. For small scrollback (<10k) this isnt even a problem either, it really is a showstopper for >100k lines.
Note the current typed array impl still has to do those shifts, but less toxic as it only has to move memory content up to the line end.
I thought about different layouts to circumvent the costly shifts, main field to save nonsense memory shifts would be to actually separate the scrollback from the "hot terminal rows" (the most recent up to terminal.rows) since only those can be altered by cursor jumps and inserts/deletes.

Sharing the underlying memory by several buffer line objects is an interesting idea to solve the wrapping problem. Not sure yet how this can work reliably without introducing explicit ref handling and such. In another version I tried to do everything with explicit memory handling, but the ref counter was a real showstopper and felt wrong in GC land. (see #1633 for the primitives)

Edit: Btw the explicit memory handling was on par with the current "memory per line" approach, I hoped for better performance due to better cache locality, guess it got eaten by the slightly more exp. mem handling in the JS abstraction.

Support for DOM nodes attached to cells sounds really interesting, I like that idea. Currently thats not possible in a direct approach since we have different renderer backends (DOM, canvas 2D and the new shiny webgl renderer), I think this still could be achieved for all renderer by positioning an overlay where not natively supported (only DOM renderer would be able to do it directly).

It's a bit off topic but I see us eventually having DOM nodes associated with cells within the viewport which will act similar to the canvas render layers. That way consumers will be able to "decorate" cells using HTML and CSS and not need to get into the canvas API.

It might make sense for the array to contain all the cells for a logical (unwrapped) lines. E.g. assume a 180-character line and an 80-column wide terminal. In that case you could have 3 BufferLine instances all sharing the same 4*180-element _data buffer, but each BufferLine would also contain a start offset.

The plan for reflow that was mentioned above is captured in https://github.com/xtermjs/xterm.js/issues/622#issuecomment-375403572, basically we want to have the actual unwrapped buffer and then a view on top which manages the new lines for fast access to any given line (also optimizing for horizontal resizes).

Using the dense array approach may be something we could look at, but it seems like it wouldn't be worth the extra overhead in managing the unwrapped line breaks in such an array, and the mess that comes when rows are trimmed from the top of the scrollback buffer. Regardless, I don't think we should look into such changes until #791 is done and we're looking at #622.

With PR #1796 the parser gets typed array support which opens the door for serveral further optimizations, on the other hand also to other input encodings.

For now I decided to go with Uint16Array, since its easy forth and back convertable with JS strings. This basically limits the game to UCS2/UTF16, while the parser in the current version can also handle UTF32 (UTF8 is not supported). The typed array based terminal buffer is currently layouted for UTF32, the UTF16 --> UTF32 conversion is done in InputHandler.print. From here on there are several directions possible:

  • make all UTF16, thus turn terminal buffer into UTF16 too
    Yeah still not settled which road to take here, but tested several buffer layouts and came to the conclusion, that a 32bit number gives enough room to store the actual charcode + wcwidth + possible combining overflow (handled totally different), while 16bit cannot do that without sacrificing precious charcode bits. Note that we even with an UTF16 buffer still have to do the UTF32 conversion, since wcwidth works on unicode codepoints. Also note that UTF16 based buffer would save more memory for lower charcodes, in fact higher than BMP plane charcodes will rarely occur. This still needs some investigation.
  • make parser UTF32
    Thats quite simple, just replace all the typed arrays with the 32bit variant. Downside - the UTF16 to UTF32 conversion would have to be done beforehand, means the whole input gets converted, even escape sequences that will never be formed of any charcode > 255.
  • make wcwidth UTF16 compatible
    Yeah if it turns out that UTF16 is more suitable for the terminal buffer this should be done.

About UTF8: The parser currently cannot handle native UTF8 sequences, mainly due to the fact that the intermediate chars clash with C1 control chars. Also UTF8 needs proper stream handling with additional intermediate states, thats nasty and imho should not be added to the parser. UTF8 will be better handled beforehand, and maybe with a conversion right to UTF32 for easier codepoint handling all over the place.

Regarding a possible UTF8 input encoding and the internal buffer layout I did a rough test. To rule out the much higher impact of the canvas renderer on the total runtime I did it with the upcoming webgl renderer. With my ls -lR /usr/lib benchmark I get the following results:

  • current master + webgl renderer:
    grafik

  • playground branch, applies #1796, parts of #1811 and webgl renderer:
    grafik

The playground branch does an early conversion from UTF8 to UTF32 before doing the parsing and storing (conversion adds ~ 30 ms). The speedup is mainly gained by the 2 hot functions during input flow, EscapeSequenceParser.parse (120 ms vs. 35 ms) and InputHandler.print (350 ms vs. 75 ms). Both benefit alot from the typed array switch by saving .charCodeAt calls.
I also compared these results with an UTF16 intermediate typed array - EscapeSequenceParser.parse is slightly faster (~ 25 ms) but InputHandler.print falls behind due to the needed surrogate pairing and codepoint lookup in wcwidth (120 ms).
Also note that I am already at the limit the system can provide the ls data (i7 with SSD) - the gained speedup adds idle time instead of making the run any faster.

Summary:
Imho the fastest input handling we can get is a mixture of UTF8 transport + UTF32 for buffer representation. While the UTF8 transport has the best byte packrate for typical terminal input and removes nonsense conversions from pty through several layers of buffers up to Terminal.write, the UTF32 based buffer can store the data pretty fast. The latter comes with a slightly higher memory footprint than UTF16 while UTF16 is slightly slower due to more complicated char handling with more indirections.

Conclusion:
We should go with the UTF32 based buffer layout for now. We should also consider switching to UTF8 input encoding, but this still needs some more thinking about the API changes and implications for integrators (seems electron's ipc mechanism cannot handle binary data without BASE64 encoding and JSON wrapping, which would counteract the perf efforts).

Buffer layout for the upcoming true color support:

Currently the typed array based buffer layout is the following (one cell):

|    uint32_t    |    uint32_t    |    uint32_t    |
|      attrs     |    codepoint   |     wcwidth    |

where attrs contains all needed flags + 9 bit based FG and BG colors. codepoint uses 21 bits (max. is 0x10FFFF for UTF32) + 1 bit to indicate combining chars and wcwidth 2 bits (ranges from 0-2).

Idea is to rearrange the bits to a better packrate to make room for the additional RGB values:

  • put wcwidth into unused high bits of codepoint
  • split attrs into a FG and BG group with 32 bit, distribute flags into unused bits
|             uint32_t             |        uint32_t         |        uint32_t         |
|              content             |            FG           |            BG           |
| comb(1) wcwidth(2) codepoint(21) | flags(8) R(8) G(8) B(8) | flags(8) R(8) G(8) B(8) |

Proside of this approach is the relatively cheap access to every value by one index acess and max. 2 bit operations (and/or + shift).

The memory footprint is stable to the current variant, but still quite high with 12 bytes per cell. This could be further optimized by sacrificing some runtime with switching to UTF16 and an attr indirection:

|        uint16_t        |              uint16_t               |
|    BMP codepoint(16)   | comb(1) wcwidth(2) attr pointer(13) |

Now we are down to 4 bytes per cell + some room for the attrs. Now the attrs could be recycled for other cells, too. Yay mission accomplished! - Ehm, one second...

Comparing the memory footprint the second approach clearly wins. Not so for the runtime, there are three major factors that increase the runtime alot:

  • attr indirection
    The attr pointer needs one additional memory lookup into another data container.
  • attr matching
    To really save room with the second approach the given attr would have to be matched against already saved attrs. This is a cumbersome action, a direct approach by simply looking through all existing values is in O(n) for n attrs saved, my RB tree experiments ended in almost no memory benefit while still being in O(log n), compared to the index access in the 32 bit approach with O(1). Plus a tree has a worse runtime for few elements saved (pays off somewhat around >100 entries with my RB tree impl).
  • UTF16 surrogate pairing
    With a 16 bit typed array we have to degrade to UTF16 for the codepoints, which introduced runtime penalty as well (as decribed on the comment above). Note that higher than BMP codepoints hardly occur, still the check alone whether a codepoint would form a surrogate pair adds ~ 50 ms.

The sexiness of the second approach is the additional memory saving. Therefore I tested it with the playground branch (see comment above) with a modified BufferLine implementation:

grafik

Yeah, we are kinda back to where we started from before changing to UTF8 + typed arrays in the parser. Memory usage dropped from ~ 1.5 MB to ~ 0.7 MB though (demo app with 87 cells and 1000 lines scrollback).

From here on its a matter of saving memory vs. speed. Since we already saved alot memory by switching from js arrays to typed arrays (dropped from ~ 5.6 MB to ~ 1.5 MB for the C++ heap, cutting off the toxic JS Heap behavior and GC) I think we should go here with the speedier variant. Once memory usage gets a pressing issue again we still can switch over to a more compact buffer layout as described in the second approach here.

I agree, let's optimise for speed as long as memory consumption is not a concern. I'd also like to avoid the indirection as far as possible because it makes the code harder to read and maintain. We already have quite a lot of concepts and tweaks in our codebase that make it hard for people (including me 😅) to follow the code flow - and bringing in more of these should always be justified by a very good reason. IMO saving another megabyte of memory doesn't justify it.

Nevertheless, I'm really enjoying reading and learning from your exercises, thanks for sharing it in such a great detail!

@mofux Yeah thats true - code complexity is much higher (UTF16 surrogate read ahead, intermediate codepoint calcs, tree container with ref counting on attr entries).
And since the 32 bit layout is mostly flat memory (only combining chars need indirection), there are more optimizations possible (also part of #1811, not yet tested for the renderer).

There is one big advantage of indirecting to an attr object: It is much more extensible. You can add annotations, glyphs, or custom painting rules. You can store link information in a possibly-cleaner and more efficient way. Perhaps define an ICellPainter interface that knows how to render its cell, and that you can also hang custom properties on.

One idea is to use two arrays per BufferLine: a Uint32Array, and ICellPainter array, with one element each for each cell. The current ICellPainter is a property of the parser state, and so you just reuse the same ICellPainter as long as the color/attribute state doesn't change. If you need to add special properties to a cell, you first clone the ICellPainter (if it might be shared).

You can pre-allocate ICellPainter for the most common color/attribute combinations - at the very least have a unique object corresponding to the default colors/attributes.

Style changes (such as changing default foreground/background colors) can be implemented by just updating the corresponding ICellPainter instance(s), without having to update each cell.

There are possible optimizations: For example use different ICellPainter instances for single-width and double-width characters (or zero-width characters). (That saves 2 bits in each Uint32Array element.) There are 11 available attribute bits in Uint32Array (more if we optimize for BMP characters). These can be used to encode the most common/useful color/attribute combinations, which can be used to index the most common ICellPainter instances. If so, the ICellPainter array can be allocated lazily - i.e. only if some cell in the line requires a "less-common" ICellPainter.

One could also remove the _combined array for non-BMP characters, and store those in the ICellPainter. (That requires a unique ICellPainter for each non-BMP character, so there is a tradeoff here.)

@PerBothner Yeah an indirection is more versatile and thus better suited for uncommon extras. But since they are uncommon Id like not to optimize for them in the first place.

Few notes on what I've tried in several testbeds:

  • cell string content
    Coming myself from C++ I tried to look at the problem as I would in C++, so I started with pointers for the content. This was a simple string pointer, but pointing most of the time to a single char string. What a waste. My first optimization therefore was to get rid of the string abstraction by directly saving the codepoint instead of the address (much easier in C/C++ than in JS). This almost doubled the read/write access while saving 12 bytes per cell (8 bytes pointer + 4 bytes on string, 64bit with 32bit wchar_t). Sidenote - half of the speed boost here is cache related (cache misses due to random string locations). This became clear with my workaround for combining cell content - a chunk of memory I indexed into when codepoint had the combined bit set (access was faster here due to better cache locality, tested with valgrind). Carried over to JS the speed boost was not that great due to the needed string to number conversion (still faster though), but the mem saving was even greater (guess due to some additional management room for JS types). Problem was the global StringStorage for the combined stuff with the explicit memory management, a big antipattern in JS. A quickfix for that was the _combined object, which delegates the cleanup to the GC. Its still a subject to change and btw is meant to store arbitrary cell related string content (did this with graphemes in mind, but we will not see them soon as they are not supported by any backend). So this is the place to store additional string content on a cell by cell basis.
  • attrs
    With the attrs I started "think big" - with a global AttributeStorage for all attrs ever used in all terminal instances (see https://github.com/jerch/xterm.js/tree/AttributeStorage). Memory wise this worked out pretty good, mainly because ppl use only a small set of attrs even with true color support. Performance was not so good - mainly due ref counting (every cell had to peek into this foreign memory twice) and attr matching. And when I tried to adopt the ref thing to JS it felt just wrong - the point I pressed the "STOP" button. In between it turned out that we already saved tons of memory and GC calls by switching to typed array, thus the slightly more costly flat memory layout can pay off its speed advantage here.
    What I tested yday (last comment) was a second typed array on line level for the attrs with the tree from https://github.com/jerch/xterm.js/tree/AttributeStorage for the matching (pretty much like your ICellPainter idea). Well the results are not promising, therefore I lean towards the flat 32 bit layout for now.

Now this flat 32 bit layout turns out to be optimized for the common stuff and uncommon extras are not possible with it. True. Well we still have markers (not used to them so I cannot tell right now what they are capable of), and yepp - there are still free bits in the buffer (which is a good thing for future needs, e.g. we could use them as flags for special treatment and such).

Tbh for me its a pity that the 16 bit layout with attrs storage performs that bad, halving the memory usage is still a big deal (esp. when ppl start to use scroll lines >10k), but the runtime penalty and the code complexity outweight the higher mem needs atm imho.

Can you elaborate on the ICellPainter idea? Maybe I missed some crucial feature so far.

My goal for DomTerm was to enable and encourage richer interaction just what is enabled by a traditional terminal emulator. Using web technologies enables many interesting things, so it would be a shame to just focus on being a fast traditional terminal emulator. Especially since many use cases for xterm.js (such as REPLs for IDEs) can really benefit from going beyond simple text. Xterm.js does well on the speed side (is anyone complaining about speed?), but it does not do so well on features (people are complaining about missing truecolor and embedded graphics, for example). I think it may be worthwhile to focus a bit more on flexibility and slightly less on performance.

_"Can you elaborate on the ICellPainter idea?"_

In general, ICellPainter encapsulates all the per-cell data except the character code/value, which comes from the Uint32Array. That is for "normal" character cells - for embedded images and other "boxes" the character code/value may not make sense.

interface ICellPainter {
    drawOnCanvas(ctx: CanvasRenderingContext2D, code: number, x: number, y: number);
    // transitional - to avoid allocating IGlyphIdentifier we should replace
    //  uses by pair of ICellPainter and code.  Also, a painter may do custom rendering,
    // such that there is no 'code' or IGlyphIdentifier.
    asGlyph(code: number): IGlyphIdentifier;
    width(): number; // in pixels for flexibility?
    height(): number;
    clone(): ICellPainter;
}

Mapping a cell to ICellPainter can be done various ways. The obvious is for each BufferLine to have a ICellPainter array, but that requires an 8-byte pointer (at least) per cell. One possibility is to combine the _combined array with the ICellPainter array: If the IS_COMBINED_BIT_MASK is set, then the ICellPainter also includes the combined string. Another possible optimization is to use the available bits in the Uint32Array as an index into an array: That adds some extra complication and indirection, but saves space.

I'd like to encourage us to check if we can do it the way monaco-editor does it (I think they found a really smart and performant way). Instead of storing such information in the buffer, they allow you to create decorations. You create a decoration for a row / column range and it will stick to that range:

// decorations are buffer-dependant (we need to know which buffer to decorate)
const decoration = buffer.createDecoration({
  type: 'link',
  data: 'https://www.google.com',
  range: { startRow: 2, startColumn: 5, endRow: 2, endColumn: 25 }
});

Later on a renderer could pick up those decorations and draw them.

Please check out this small example that shows how the monaco-editor api looks like:
https://microsoft.github.io/monaco-editor/playground.html#interacting-with-the-editor-line-and-inline-decorations

For things like rendering pictures inside the terminal monaco uses a concept of view zones that can be seen (among other concepts) in an example here:
https://microsoft.github.io/monaco-editor/playground.html#interacting-with-the-editor-listening-to-mouse-events

@PerBothner Thx for clarification and the sketchup. A few notes on that.

We eventually plan to move the input chain + buffer into a webworker in the future. Thus the buffer is meant to operate on an abstract level and we cannot use any render/representation related stuff there yet like pixel metrics or any DOM nodes. I see your needs for this due to DomTerm being highly customizable, but I think we should do that with an enhanced internal marker API and can learn here from monaco/vscode (thx for th pointers @mofux).
I really would like to keep the core buffer clean of uncommon things, maybe we should discuss possible marker strategies with a new issue?

I am still not satisfied with the outcome of the 16 bit layout test results. Since a final decision is not yet pressing (we wont see any of this before 3.11), I gonna keep testing it with a few changes (its still the more intruiging solution for me than the 32 bit variant).

|             uint32_t             |        uint32_t         |        uint32_t         |
|              content             |            FG           |            BG           |
| comb(1) wcwidth(2) codepoint(21) | flags(8) R(8) G(8) B(8) | flags(8) R(8) G(8) B(8) |

I also think we should go with something close to this to start, we can explore other options later but this will probably be the easiest to get up and running. Attribute indirection definitely has promise IMO as there typically aren't that many distinct attributes in a terminal session.

I'd like to encourage us to check if we can do it the way monaco-editor does it (I think they found a really smart and performant way). Instead of storing such information in the buffer, they allow you to create decorations. You create a decoration for a row / column range and it will stick to that range:

Something like this is where I'd like to see things go. One idea I had along these lines was to allow embedders to attach DOM elements to ranges to enable custom things to be drawn. There are 3 things I can think of at the moment that I'd like to accomplish with this:

  • Draw link underlines this way (will simplify how they are drawn significantly)
  • Allow markers on rows, like a * or something
  • Allow rows to "flash" to indicate something happened

All of these could be achieved with an overlay and it's a pretty approachable type of API (exposing a DOM node) and can work regardless of renderer type.

I'm not sure we want to get into the business of allowing embedders to change how background and foreground colors are drawn.


@jerch I'll put this on the 3.11.0 milestone as I consider this issue finished when we remove the JS array implementation which is planned for then. https://github.com/xtermjs/xterm.js/pull/1796 is also planned to be merged then, but this issue was always meant to be about improving the buffer's memory layout.

Also, a lot of this later discussion would probably be better had over at https://github.com/xtermjs/xterm.js/issues/484 and https://github.com/xtermjs/xterm.js/issues/1852 (created as there wasn't a decorations issue).

@Tyriar Woot - finally closed :sweat_smile:

🎉 🕺 🍾

Was this page helpful?
0 / 5 - 0 ratings