Ctags: PHP heredoc (<<<) syntax breaks tags creation

Created on 20 Nov 2020  ·  8Comments  ·  Source: universal-ctags/ctags

SUMMARY:

Tags generation stops when the PHP heredoc (<<<) syntax is encountered in a file. As the nowdoc PHP syntax is basically the same, that's another language element that breaks the file parsing.

The name of the parser:

not sure about this. Assuming PHP

The command line you used to run ctags:
$ ctags --options=NONE foo.php
The content of input file:
<?php

class LivingBeings {

    public function doSomething()
    {
        $foo = <<<FOO
        FOO;
    }

    public function doSomethingElse()
    {
    }
}
The tags output you are not satisfied with:

The doSomethingElse method is not listed in the file. As soon as I comment out the heredoc portion, the method is indexed normally, as you can see in the "expected output" section ahead.

!_TAG_FILE_FORMAT   2   /extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED   1   /0=unsorted, 1=sorted, 2=foldcase/
!_TAG_OUTPUT_EXCMD  mixed   /number, pattern, mixed, or combine/
!_TAG_OUTPUT_FILESEP    slash   /slash or backslash/
!_TAG_OUTPUT_MODE   u-ctags /u-ctags or e-ctags/
!_TAG_PATTERN_LENGTH_LIMIT  96  /0 for no limit/
!_TAG_PROC_CWD  /tmp/   //
!_TAG_PROGRAM_AUTHOR    Universal Ctags Team    //
!_TAG_PROGRAM_NAME  Universal Ctags /Derived from Exuberant Ctags/
!_TAG_PROGRAM_URL   https://ctags.io/   /official site/
!_TAG_PROGRAM_VERSION   5.9.0   /5a136315/
LivingBeings    foo.php /^class LivingBeings {$/;"  c
doSomething foo.php /^    public function doSomething()$/;" f   class:LivingBeings
The tags output you expect:
!_TAG_FILE_FORMAT   2   /extended format; --format=1 will not append ;" to lines/
!_TAG_FILE_SORTED   1   /0=unsorted, 1=sorted, 2=foldcase/
!_TAG_OUTPUT_EXCMD  mixed   /number, pattern, mixed, or combine/
!_TAG_OUTPUT_FILESEP    slash   /slash or backslash/
!_TAG_OUTPUT_MODE   u-ctags /u-ctags or e-ctags/
!_TAG_PATTERN_LENGTH_LIMIT  96  /0 for no limit/
!_TAG_PROC_CWD  /tmp/   //
!_TAG_PROGRAM_AUTHOR    Universal Ctags Team    //
!_TAG_PROGRAM_NAME  Universal Ctags /Derived from Exuberant Ctags/
!_TAG_PROGRAM_URL   https://ctags.io/   /official site/
!_TAG_PROGRAM_VERSION   5.9.0   /5a136315/
LivingBeings    foo.php /^class LivingBeings {$/;"  c
doSomething foo.php /^    public function doSomething()$/;" f   class:LivingBeings
doSomethingElse foo.php /^    public function doSomethingElse()$/;" f   class:LivingBeings
The version of ctags:
$ ctags --version
Universal Ctags 5.9.0(5a136315), Copyright (C) 2015 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: Nov 20 2020, 11:46:20
  URL: https://ctags.io/
  Optional compiled features: +wildcards, +regex, +iconv, +option-directory, +xpath, +yaml, +packcc
How do you get ctags binary:

Building it locally:

$ cd ctags_source
$ make clean && make distclean
$ ./autogen.sh
$ ./configure --prefix=$HOME
$ make
$ make install
Parser buenhancement

All 8 comments

@jespinal, are you talking about this change: https://wiki.php.net/rfc/flexible_heredoc_nowdoc_syntaxes ?

$ git diff |cat
git diff |cat
diff --git a/parsers/php.c b/parsers/php.c
index e3fdc241..ace25561 100644
--- a/parsers/php.c
+++ b/parsers/php.c
@@ -682,6 +682,8 @@ static void parseHeredoc (vString *const string)
            int extra = EOF;

            c = getcFromInputFile ();
+           if (c == ' ' || c == '\t')
+               c = getcFromInputFile ();
            for (len = 0; c != 0 && (c - delimiter[len]) == 0; len++)
                c = getcFromInputFile ();

$ cat input.php
cat input.php
<?php
// Taken from https://github.com/universal-ctags/ctags/issues/2717
// submitted by @jespinal
class LivingBeings {

    public function doSomething()
    {
        $foo = <<<FOO
        FOO;
    }

    public function doSomethingElse()
    {
    }
}
$ u-ctags -o - input.php
u-ctags -o - input.php
LivingBeings    input.php   /^class LivingBeings {$/;"  c
doSomething input.php   /^    public function doSomething()$/;" f   class:LivingBeings
$

@masatake that's not enough, because the ending marker used to have to be on its own line, while the new version lifts that restriction. I don't find the explanation very clear:

The implementation I am proposing avoids this problem by checking to see if a continuation of the found marker exists, and if so, then if it forms a valid identifier.

but I'd say that it means that unless there is a identifier character after the line prefixed with the terminating marker, it is indeed a terminating marker. So END; is a termination (given that the marker is END), but ENDFOO isn't.

BTW, as this is a backward incompatible syntactic change, I don't know what we want to do about it. But I guess if PHP is happy to break it, we can as well, especially as it's fairly unlikely to cause problem. Ideally I guess we'd use the current syntax for *.php[1-6] and the new one for the rest, but that might be too much trouble for what it's worth.

@jespinal, are you talking about this change: https://wiki.php.net/rfc/flexible_heredoc_nowdoc_syntaxes ?

Sorry, @masatake , for some reason I was not notified of your question.

Yes, I'm talking about that change. But that was actually implemented in PHP 7.3 (current stable version is 7.4, and version 8 is at hand). I'm not sure why this was not reported earlier considering the vast user-base of both, ctags and PHP.

I'm adding a few screenshots of code snippets derived from the previous example in order to (hopefully) shed some light on what they consider valid/invalid syntax in regards to the new heredoc/nowdoc syntax (the RFC is not clear enough, I think).

In this example, 'TEXT' (second one) is the ending marker. So, the third one 'TEXT; is a syntactically invalid string in the view of the parser, as it would expect only a semicolon or a comma:

test-001-2020-11-24 22-33-30

A similar case as the previous one:
test-003-2020-11-24 22-37-04

Had it been a semicolon or a comma, php parser would have been happy. E.g.

        echo <<<TEXT
            some string 
        TEXT, 'some other string';

In the eyes of the parser, this is the same as:

    echo 'some string', 'some other string';

The following is a valid example, as the parser knows that 'TEXT' and 'TEXTUAL' are two different strings:

test-002-2020-11-24 22-36-25

Here's a couple of invalid snippets due to wrong indentation. Specifically, to the RFC statement: "If the closing marker is indented further than any lines of the body, then a ParseError will be thrown:"

test-004-2020-11-24 22-38-11

test-005-2020-11-24 22-39-38

@jespinal thanks, but if you have normative text that's be even better :) It's always tricky to guess the logic based solely on a few cases, whereas if we have the normative text we can just implement that and it should hopefully work. And actually, I think we have enough with @masatake's link and your info :+1:

@masatake I don't promise anything given the little time I find lately, but I'll try to give this a look soon unless -- you beat me to it :)

BTW @jespinal if nobody complained I really think it's because there's very little use of those syntax, and we support pre-7.3 syntax, so the only cases where one would see a problem is with 7.3+ syntax usage, which implies using neredoc/nowdoc in the first place :)

You were inactive for a while. So I didn't expect to get a comment from you.
But, now we get the "self-assigned" sign from you. @b4n, thank you for the offering.

@masatake you were wise not to expect much of me, as I indeed didn't find time for much UCtags/Geany contributions lately :disappointed: . I'm trying to find how to allocate time here again, so I hope I'll be more active again, but I can't promise just yet.

Nonetheless, see #2734 for a fix for the issue at hand :)

Thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cweagans picture cweagans  ·  4Comments

fabiensabatie picture fabiensabatie  ·  3Comments

cweagans picture cweagans  ·  13Comments

lvc picture lvc  ·  8Comments

jayceekay picture jayceekay  ·  13Comments