Jsdom: innerText implementation

Created on 25 Sep 2015 · 26Comments · Source: jsdom/jsdom

jsdom is a great tool for web scraping. However the textContent is a very inconvenient way to get readable text for html2text conversion.

There is a wonderful article about usefulness of negligible innerText in many cases:

http://perfectionkills.com/the-poor-misunderstood-innerText/

The author suggests getSelection().toString() as a very slow workaround, but getSelection is not implemented in the jsdom yet.

Could you consider an implementing of the innerText in the jsdom? The author has done a great exploration about it, he has even added a simple spec at the end.

feature layout

Source

vsemozhetbyt

👍62 ❤7

Most helpful comment

In case anyone else is running into this issue I took it 1 step further and used the sanitize-html package to get basically what the browser is doing (note I did not import the JSDOM setup as I found it wasn't needed when putting this in my Jest setup file but if you're not using Jest then you'll want to use the global.Element = (new JSDOM()).window.Element setup that @bennypowers recommended):

Object.defineProperty(global.Element.prototype, 'innerText', {
  get() {
    return sanitizeHtml(this.textContent, {
      allowedTags: [], // remove all tags and return text content only
      allowedAttributes: {}, // remove all tags and return text content only
    });
  },
  configurable: true, // make it so that it doesn't blow chunks on re-running tests with things like --watch
});

jettlin on 6 Mar 2019

👍11 ❤8 🎉3

All 26 comments

And what a pity that rangy Selection and innerText library is not compatible with jsdom: https://github.com/timdown/rangy/issues/348

vsemozhetbyt on 25 Sep 2015

So, innerText is not standard, and not implemented in at least one major engine (Firefox). Without a standard, I don't think we should implement it.

domenic on 25 Sep 2015

👎58 👍8

Looks like there's some movement in this whole thing with a draft spec here. See also all the references. There are no issues on the repo though, so I wonder how complete it already is / how quick progress will be.

Sebmaster on 9 Oct 2015

Firefox has implemented: https://bugzilla.mozilla.org/show_bug.cgi?id=264412

WHATWG semms to approve: https://github.com/whatwg/compat/issues/5#issuecomment-168049752

vsemozhetbyt on 25 Jan 2016

👍16

From the spec it's seems like we can't implement innerText properly without basic layout support.

inikulin on 25 Jan 2016

Yeah, this is not really going to be implementable in jsdom anyway, without a lot of infrastructure work... nobody get their hopes up :(.

domenic on 25 Jan 2016

👍1

As to layout support requirement: https://github.com/rocallahan/innerText-spec/issues/2

vsemozhetbyt on 30 Jan 2016

Is there any plan to implement it because of WHATWG adoption?

vsemozhetbyt on 27 Aug 2016

Yeah... Although the spec requires a lot of stuff jsdom doesn't have, around CSS boxes :(. Not sure what to do.

domenic on 27 Aug 2016

👍3

Is there any lib for this to plug along with jsdom?

vsemozhetbyt on 27 Aug 2016

@domenic care to drop some knowledge on why this is such an infrastructure overhaul? We thought the 800lb gorilla in the room would leave lo-key. But looks like it's not going anywhere. As you know have been wrapping my head around the innards of jsdom. Where would be a great place in the repo to start reviewing code to a jsdom newb?

Thanks in advance 🙏 /cc @vsemozhetbyt

snuggs on 29 Aug 2016

The primary issue is the fact that innerText leans on the layout engine for guidance, and jsdom has no layout engine. See https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute and
http://perfectionkills.com/the-poor-misunderstood-innerText/ . From the second link:

Notice how innerText almost precisely represents exactly how text appears on the page. textContent, on the other hand, does something strange — it ignores newlines created by
and around styled-as-block elements ( in this case). But it preserves spaces as they are defined in the markup.

dmethvin on 29 Aug 2016

Still out of scope and no workaround?

vsemozhetbyt on 26 Apr 2017

👍2

Apparently the spec says:

If this element is not being rendered, or if the user agent is a non-CSS user agent, [emphasis added] then return the same value as the textContent IDL attribute on this element.

I think a workaround would be then to simply return textContent.

coreh on 25 May 2017

👍12

We implement enough CSS that I don't think that applies. We just don't implement the layout parts...

domenic on 25 May 2017

👍1

Hi guys, any news on this one?

Suzii on 24 Jan 2018

Just use headless chrome :)

Bnaya on 25 Jan 2018

👎25 ❤1 👍1

@domenic from that spec that @coreh mentioned:
https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute

If this element is not being rendered, or if the user agent is a non-CSS user agent, then return the same value as the textContent IDL attribute on this element.

https://html.spec.whatwg.org/multipage/rendering.html#being-rendered

An element is being rendered if it has any associated CSS layout boxes, SVG layout boxes, or some equivalent in other styling languages.

If jsdom doesn't implement the layout parts, doesn't that mean "not being rendered" applies?

Janpot on 5 Aug 2018

👍4

This message is for anyone reaching this github thread that just wants a way to get their tests passing without changing their function implementations.

copypasta for the top of your test files:

// Expose JSDOM Element constructor
global.Element = (new JSDOM()).window.Element;
// 'Implement' innerText in JSDOM: https://github.com/jsdom/jsdom/issues/1245
Object.defineProperty(global.Element.prototype, 'innerText', {
  get() {
    return this.textContent;
  },
});

Naturally, caveats from the above discussion apply.

bennypowers on 10 Dec 2018

👍9

Object.defineProperty(global.Element.prototype, 'innerText', {
  get() {
    return sanitizeHtml(this.textContent, {
      allowedTags: [], // remove all tags and return text content only
      allowedAttributes: {}, // remove all tags and return text content only
    });
  },
  configurable: true, // make it so that it doesn't blow chunks on re-running tests with things like --watch
});

jettlin on 6 Mar 2019

👍11 ❤8 🎉3

i had a similar need but wanted to go slightly further than just using the textContent - again, this won't be an accurate representation of what browsers actually do, especially with respect to elements hidden by css, but it's good enough for my use case:

function innerText(el)
  el = el.cloneNode(true) // can skip if mutability isn't a concern
  el.querySelectorAll('script,style').forEach(s => s.remove())
  return el.textContent
}

macgyver on 11 Feb 2020

👍4 ❤1

What a pity！

ty5491003 on 10 Mar 2020

Apparently the spec says:

If this element is not being rendered, or if the user agent is a non-CSS user agent, [emphasis added] then return the same value as the textContent IDL attribute on this element.

I think a workaround would be then to simply return textContent.

We implement enough CSS that I don't think that applies. We just don't implement the layout parts...

@domenic please consider a more liberal interpretation of the spec

textContent is explicitly allowed as fallback, when application of CSS rules is too expensive

also, innerText is specified as getter and setter

milahu on 18 Sep 2020

Given that I am the spec editor, I can state with certainty that "when application of CSS rules is too expensive" is not what the spec is saying.

domenic on 18 Sep 2020

👀1 👍1

.. that was my interpretation of "if the user agent is a non-CSS user agent"

whats the difference between a "CSS user agent" and a "non-CSS user agent"?

what about:
a CSS user agent can "apply CSS rules" and output the result (graphic or textual)
a non-CSS user agent is too dumb to "apply CSS rules"

We implement enough CSS that I don't think that applies.

what do you mean? window.getComputedStyle?

a fallback to textContent is still better than not implementing a standard interface

milahu on 18 Sep 2020

Maybe we can just use textContent value to replace the result of innerText while running tests with jsdom. For example:

describe('mytest', () => {
  beforeAll(() => {
    Object.defineProperty(HTMLElement.prototype, 'innerText', {
      get() {
        return this.textContent;
      }
    });
  });
  it('should ok', () => {
  // test assertions
  });
});