Latex3: Document catcodes for xparse's "verbatim" argument type, document how to reproduce \verb

Created on 25 Jun 2020 · 43Comments · Source: latex3/latex3

When fontenc is loaded with its T1 option, \NewDocumentCommand with verbatim argument gobbles the first - if its content contains a -- (irrespective of the delimiters used):

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{xparse}
\NewDocumentCommand {\myverb} { v } {#1}
\begin{document}
\ttfamily
\verb|--all|

\myverb{-all}

\myverb{--all}

\myverb{---all}
\end{document}

bug documentation xparse

Source

dbitouze

Most helpful comment

Regardless of the outcome of this discussion, it will be useful to document in
xparse.pdf how to reproduce the behaviour of verb using \NewDocumentCommand.

blefloch on 26 Jun 2020

👍3

All 43 comments

What you are seeing is the -- ligature in the typewriter font. If you write \texttt{--} you'll also see a single dash, but if you copy from the PDF, you'll see that it's indeed an en-dash. You can check that by feeding the grabbed argument to \showtokens or by using \@noligs (LaTeX uses that in \verb to have -- print --):

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{xparse}
\makeatletter
% \NewDocumentCommand {\myverb} { v } { \showtokens{#1} }
\NewDocumentCommand {\myverb} { v } {#1}
\begin{document}
\makeatletter
\ttfamily \@noligs
-- and \verb|--all|

- and \myverb{-all}

-- and \myverb{--all}

--- and \myverb{---all}
\end{document}

PhelypeOleinik on 25 Jun 2020

but v in xparse is supposed to be verbatim (is it not ?) and in LaTeX that means typewriter with ligatures suppressed so v should do that too in my opinion.

FrankMittelbach on 25 Jun 2020

It just grabs verbatim (verbatim here being some equivalent of \let\do\@makeother \dospecials). \@noligs could be added in the catcode setup for scanning the argument. On the other hand, this would insert active tokens where (theoretically) there are only be catcode-other tokens, so in case the argument is used for something other than typesetting, it could be problematic.

Perhaps some way to allow the command to add its own catcode settings, like:

\NewDocumentCommand {\myverb} { v{\@noligs} } {#1}

PhelypeOleinik on 25 Jun 2020

@FrankMittelbach I agree with PhelypeOleinik that “verbatim” means “grab whatever user has written verbatim”. The <hyphen hyphen> to <endash> ligature is more a “font feature” than a “argument-grabbing bug”. Also, \ttfamily does not mean “monospaced font = no ligature whatsoever”. Some monospaced typefaces can be used as body type (not just code), so the hyphen ligatures should not be suppressed in such cases.

RuixiZhang42 on 25 Jun 2020

But isn't \NewDocumentCommand {\myverb} { v } {#1}\myverb{--all} supposed to behave as \verb|--all|?

dbitouze on 25 Jun 2020

@dbitouze — not really. \verb has two parts to it — argument grabbing and formatting. The “v” setting in \NewDocumentCommand only does the former.

@Phelype — unless I’m missing something, doesn’t your suggestion do no more than this?:

\NewDocumentCommand {\myverb} { v } { {\@noligs #1} }

In that case I don’t think it is necessary. So back to @dbitouze, the way to replicate \verb is something like:

\makeatletter
\NewDocumentCommand {\myverb} { v } { {\@noligs\ttfamily #1} }
\makeatother

wspr on 25 Jun 2020

👎1 👍1

On Thu, 25 Jun 2020 at 08:22, Denis Bitouzé notifications@github.com
wrote:

But isn't \NewDocumentCommand {\myverb} { v } {#1}\myverb{--all} supposed
to behave as \verb|--all|?

—

Not exactly, as v is just about parsing the argument, and that is read
verbatim, verb also typesets the content in a non standard monospace font
setup that suppresses ligatures.
so rather than just #1 to typeset the argument in the current font you'd
need to do

\verbatim@font\@noligs
\language\l@nohyphenation

except that \@noligs requires
\defverbatim@nolig@list{\do`\do\<\do>\do\,\do\'\do-}
to be active so we could consider either making v set those as active. or
providing a wrapper around scantokens that arranges that \@noligs can work
here

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/latex3/latex3/issues/756#issuecomment-649302149, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAJVYAVLBB4ABB3DD5TETRDRYL3MPANCNFSM4OHMH74A
.

davidcarlisle on 25 Jun 2020

@dbitouze no, the similarity is only in the way the argument can be delimited: you can use \myverb!abc!. The result is documented as

which will result in the grabbed argument consisting of tokens of category codes 12 (“other”) and 13 (“active”), except spaces, which are given category code 10 (“space”).

The argument parser only reads an argument, it doesn't typeset it. And it would make no sense to add font commands or other commands to it or even to preprocess it to apply \@noligs by default: There are other ways to suppress ligatures. With luatex one would apply perhaps Ligatures=Resetall and with pdflatex one could use \pdfnoligatures with a slightly different font:

~~~~
\RequirePackage{fix-cm}
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{xfp,xparse}

\makeatletter
\NewDocumentCommand {\myverb} { v } {{\fontsize{\fpeval{\f@size+0.0001}}{\normalbaselineskip}\selectfont\pdfnoligatures\font #1}}
\makeatother

\begin{document}
--all

verb|--all|

\myverb{-all}

\myverb{--all}

\myverb{---all}

\footnotesize
--all \myverb{--all}

\ttfamily
--all

verb|--all|

\myverb{-all}

\myverb{--all}

\myverb{---all}

\footnotesize
--all \myverb{--all}

\end{document}
~~~~

u-fischer on 25 Jun 2020

@phelype — unless I’m missing something, doesn’t your suggestion do no more than this?:
\NewDocumentCommand {\myverb} { v } { {\@noligs #1} }

@wspr Kind of, but no: \@noligs changes the catcode of - (and a bunch others) to 13, and then define it as \def-{\leavevmode\kern\z@\char`-}: being a catcode change, it has to be done before the argument is grabbed (unless we are considering \scantokens), thus my suggestion to allow a “catcode setup” argument to v (though it would have to be optional: \NewDocumentCommand {\myverb} { v[\@noligs] } {#1}).

PhelypeOleinik on 25 Jun 2020

Thanks @phelype — it’s been a while since I looked inside that macro :)
In that case I like the idea of the setup argument… even if in this instance other approaches can also work to disable the ligatures.

wspr on 25 Jun 2020

We don't have optional data in the arg spec, so it would need a new letter (w?)

josephwright on 25 Jun 2020

Or a breaking change to v-type

josephwright on 25 Jun 2020

I would rather vote for V (matching that we have o and O and d and D) then to consider a breaking change.

FrankMittelbach on 25 Jun 2020

We don't have optional data in the arg spec, so it would need a new letter (w?)

Can't we add one?

Or perhaps, since we have o and O{}, it seems natural to have v and V{}. Of course the argument would mean different things...

PhelypeOleinik on 25 Jun 2020

Imho if the catcodes should be customizable for the v-type it would make sense to use the cctab code, and not some arbitrary command like \@noligs. Then the reading of the command would only set the catcodes and definitions of active chars should then be done in the macro body.

u-fischer on 25 Jun 2020

@u-fischer So I better get that PR for l3cctab in ...

josephwright on 25 Jun 2020

👍1

Currently, I've no idea on how l3cctab works and how it could be helpful for the current issue but I am really interested :)

dbitouze on 25 Jun 2020

My point was that semantically v is 'verbatim' whereas what's needed here is not. Importantly, you have to worry if the delimiting chars are altered by the catcode table or whatever. Also, we've been consistent that uppercase letters -> some optional-arg variant of a lowercase one. So I'd say something like c{<table>} (= 'catcode') would be right.

josephwright on 25 Jun 2020

I'll get the cctab stuff sorted today or tomorrow if I can, so we can discuss.

josephwright on 25 Jun 2020

@dbitouze A catcode table is a way of having a 'fixed' set of catcode for all chars(*). It means you get a one-token interface for the changes, so '\c_document_cctab for normal catcodes, \c_initex_cctab for IniTeX, etc. The idea is this is a lot clearer and more reliable than one-by-one setting.

In XeTeX, we don't have the necessary primitive, so I can only cover chars 0 to 255 with reasonable performance.

josephwright on 25 Jun 2020

@josephwright (off topic) It seems to me that you wanted to add a footnote but Markdown didn’t know that.

joulev on 25 Jun 2020

Suppressing a known set of ligatures during output can also be done by using \tl_replace_all:Nnn and replacing the problematic character with something that won't form the ligature.

Skillmon on 25 Jun 2020

@Skillmon Good point: one could take the verbatim material and replace tokens. As everything is strictly verbatim, that's probably an easier approach than worrying about catcode setup.

josephwright on 26 Jun 2020

Regardless of the outcome of this discussion, it will be useful to document in
xparse.pdf how to reproduce the behaviour of verb using \NewDocumentCommand.

blefloch on 26 Jun 2020

👍3

@josephwright depending on the number of tokens to be replaced the performance will be a lot worse with the \tl_replace_all:Nnn approach.

Skillmon on 26 Jun 2020

Also, how will you know which characters (in a very large font) need replacing?

Further: what does typesetting ‘verbatim with a monospaced font’ mean for many scripts (non-European)?

car222222 on 27 Jun 2020

@car222222 in a very large font you got the font features as the only reasonable way to suppress them all, LaTeX can't know about all ligatures possible in a font. But at least the characters supported LaTeX2e could be covered easily (it's just a \tl_map_function:NN and \tl_replace_all:Nnn).

Further: AFAIK there are double spaced symbols in some monospaced fonts for some non-European scripts.

Skillmon on 27 Jun 2020

I would suggest just wrapping every character in an hbox. It seems to work
reasonably well, but I didn't test extensively.

\RequirePackage{xparse}
\ExplSyntaxOn
\NewDocumentCommand{\myverb}{v}{\texttt{\str_map_function:nN{#1}\hbox:n}}
\ExplSyntaxOff
\documentclass{article}
\usepackage[T1]{fontenc}
\begin{document}
verb|a--b ---c ``<''|

\myverb|a--b ---c ``<''|
\end{document}

blefloch on 27 Jun 2020

I would suggest just wrapping every character in an hbox. It seems to work reasonably well, but I didn't test extensively.

Try with |a--bgrüße ---c ``<''|

u-fischer on 27 Jun 2020

Ok, second attempt (the v arg keeps active chars as is): insert \kern 0pt\relax before all non-active chars.

\RequirePackage{xparse}
\ExplSyntaxOn
\tl_new:N \l__myverb_tl
\cs_new:Npn \__myverb:n #1
  {
    \token_if_active:NF #1 { \kern 0pt\relax }
    \exp_not:n {#1}
  }
\NewDocumentCommand { \myverb } { v }
  {
    \tl_set:Nn \l__myverb_tl {#1}
    \tl_replace_all:Nnn \l__myverb_tl { ~ } { { ~ } }
    \group_begin:
      \use:c { verbatim@font }
      \use:x { \tl_map_function:NN \l__myverb_tl \__myverb:n }
    \group_end:
  }
\ExplSyntaxOff
\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\begin{document}
\verb|a--bgrüße ----c ``<''|

\myverb|a--bgrüße ----c ``<''|
\end{document}

blefloch on 27 Jun 2020

Ok, second attempt (the v arg keeps active chars as is):

But not everyone. E.g. the quote here is not active:

~~~~
\documentclass{article}
\usepackage[ngerman]{babel}
\begin{document}
"a

\ExplSyntaxOn
\NewDocumentCommand { \myverb } { m v }
{
\tl_analysis_show:n{#1}
\tl_analysis_show:n{#2}
}
\ExplSyntaxOff

\myverb{"a}|"a|
\end{document}
~~~~

gives

~~~~
The token list contains the tokens:

" (active character=macro:->active@prefix "active@char" )
a (the letter a).
}

l.29 \myverb{"a}|"a|

?
The token list contains the tokens:

" (the character ")
a (the character a).
}
~~~~

u-fischer on 27 Jun 2020

@u-fischer but which of these two outputs does one want in the ‘verbatim text’?

The Non-active " case looks to me like what LaTeX used to mean by ‘verbatim’.

But maybe some people expect the output to be, for example, ä which does not look much like ‘verbatim’ to other people.

As I have written so many times: what does ‘verbatim’ mean outside printable 7-bit ASCII?

car222222 on 28 Jun 2020

@car222222 my example is about input not output. I'm not outputting anything, only analysing how the argument grabed by xparse looks like. From the documentation I expected the argument to let active chars as they are and convert all other tokens to catcode 12, and spaces to catcode 10. But as some tests show my expectation was wrong: active chars setup with babel are converted to catcode 12 too as the argument parser contains a \dospecial.

u-fischer on 28 Jun 2020

One other question: Should this v-type argument collapse consecutive spaces into one token (with catcode 10), or preserve the number of spaces “verbatim”? Exactly how “verbatim” should it be (I don’t think it is well-defined right now in the manual)?

RuixiZhang42 on 28 Jun 2020

@u-fischer Input/output ?? But you answered my implied question.

You want to keep the active “ but I think that verbatim should produce the non-active “ so that no ä glyph can be output.

car222222 on 29 Jun 2020

When I say “I think” I mean that this is what I would expect to follow from the original (40 years ago) concept of ‘verbatim’ in TeX/LaTeX.

Maybe that concept+definitions needs to be changed, but to what exactly?

car222222 on 29 Jun 2020

Or as @RuixiZhang42 put it: how verbatim is 21st Century verbatim ?

car222222 on 29 Jun 2020

Interestingly verb*|Y Z| (with two consecutive tabs) gives a single space
because verb changes the catcode of space but not of tab.

I agree v-type (and "+v" too) argument is not well-defined right now. Would the
following make sense, using a catcode table?

Give catcode 13 (active) to spaces and a few others (`\^^M for instance?).
Give catcode 12 (other) to all other bytes 0-127 (in pdfTeX), 0-255 (in XeTeX,
upTeX, pTeX), or 0-1114111 (LuaTeX).
In pdfTeX, give catcode 13 (active) to bytes 128-255.

Maybe it is better to keep some stuff as catcode 11 (letters)?

blefloch on 29 Jun 2020

You want to keep the active “

No I didn't say that. I only wrote that I expected this to happen after reading the documentation. This only implies that the documentation needs improving.

but I think that verbatim should produce the non-active “ so that no ä glyph can be output.

You can get "a as output also with an active ": you only need to give it locally a suitable definition.

u-fischer on 29 Jun 2020

@u-fischer You can get "a as output also with an active "

Sure, but I do not know whether ‘verbatim mode’ should need such customisation? Maybe it should?

Back to the question: what does ‘verbatim’ mean, both for reading an input token list and also for output (including what font, with what ligatures, kerning, other font features, etc.,etc.).

Maybe something like this (for input of printable ASCII only):
no character is removed or character code is changed, catcode of most becomes 12, except the following, which are changed to (or kept at) 13, . . . .
Plus the following non printable ascii that are also become catcode 13 tokens: . . .

The environment must be customised to deal with the output (text representation) of any 7-bit ACSII character that might by the above process turn out to be out to be internally a catcode 13 token.

[Quite a bit different from the original, but still covering only ASCII input, like the original.]

car222222 on 29 Jun 2020

@blefloch wrote: In pdfTeX, give catcode 13 (active) to bytes 128-255.

Would you do this even if inputenc is not being used? What definition would you give them?

I am unsure if anyone has thought much about utf-8 inputenc input to verbatim mode. Will this be supported, and what does it mean?

car222222 on 29 Jun 2020

I am unsure if anyone has thought much about utf-8 inputenc input to verbatim mode. Will this be supported, and what does it mean?

It is supported, at least for T1 encoding. For greek or similar you would have to redefine verbatim@font:

~~~
\documentclass{article}
\usepackage[LGR,T1]{fontenc}
\begin{document}
verb|grüße € |

\makeatletter
\defverbatim@font{\ttfamily}
fontencoding{LGR}\selectfont
verb|Γειά σου Κόσμε|
\end{document}
~~~

u-fischer on 29 Jun 2020

Possible suggestions for what the verbatim argument should do. I tend towards option 1, but I may be missing some aspects.

Update catcodes from 0 to 255 (in any engine), keeping catcodes 11, 12, 13 (letter/other/active) unchanged, changing catcode 10 (space) to catcode 13 (active), and all other catcodes to 12 (other). Then apply the catcode changes in \@noligs, namely make items of \verbatim@nolig@list active. Then grab the argument: this gives a result with catcodes 11, 12, 13 only. It is easy to convert back to a string for users who don't want active characters. For those wanting inputenc or babel-shorthand support, all active characters have been kept. It also supports the ligature suppression.
Use a catcode table \l_xparse_verbatim_cctab that can be changed by the user. This is hard to keep in sync with babel shortcuts that may change mid-document. It is also unwieldy for the package writer since they need to have a wrapper function that changes \l_xparse_verbatim_cctab before parsing the verbatim argument.
Variant of 2. where the cctab is given as an argument v (optional argument, or new letter). Again, this cannot be kept in sync with babel shorthands and changes to the \verbatim@nolig@list.

blefloch on 13 May 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

xparse: eats space token while looking for not provided, trailing optional argument

frougon · 6Comments

Why not providing \box_ht_plus_dp?

dbitouze · 4Comments

Support for HarfTeX

josephwright · 12Comments

xparse not working in ConTeXt MkIV

JairoAdelRio · 7Comments

Documenting F form when TF does not exist

dbitouze · 8Comments