Ctags: parallel ctags

Created on 13 Jan 2016  ·  19Comments  ·  Source: universal-ctags/ctags

as far as I understand it, ctags is single threaded. Are there any plans to support parallelisation? May speed things up on huge codebases.

All 19 comments

Hello,

while built in parallel implementation can be interesting, its already possible to parallelize updating a big codebase by launching different ctags on different directory and then merging the generated files (which can be done simply by dropping the lines beginning with ! from all files but one and using sort --merge on all files afterwards).

However, I am not convinced you'll get any speedup from parallelized ctags, as I expect modern machines to be I/O bound. That would need to be profiled to make sure though.

@mawww I'm sure https://github.com/ggreer/the_silver_searcher would disagree :wink:

Running multiple ctags would be extremely difficult to coordinate from the standard emacs https://github.com/bbatsov/projectile/blob/master/projectile.el#L180-L183

@mawww I'm sure https://github.com/ggreer/the_silver_searcher would disagree :wink:

Good point.

Running multiple ctags would be extremely difficult to coordinate from the standard emacs https://github.com/bbatsov/projectile/blob/master/projectile.el#L180-L183

A shell script wrapper could already go a long way, but yeah it might be more efficient to integrate that directly into ctags.

@fommil That guy's blogpost on the matter is not very clear from where it started to where it went (well, you can read it between the lines, but well), and anyway it's really not that much. And I don't mean to disregard any of his work, but I'm not gonna fully trust the results of someone that seemingly just learned about multithreading (esp. because of i.e. how much a misused mutex destroys any performance that MT can give). Not saying he's not totally right, but I'll need to be convinced :)
And note how his own tests showed that on his machines too many worker threads quickly became worse than a no parallelization at all. It's cute, but likely to heavily depend on the hardware, OS and data to process, so it should probably be more sensible than "well, using N threads seemed to perform better in my tests".

Also, another reason why it doesn't appeal me so much is that not only I don't believe it'll give us so much, but it'll be a large amount of error-prone work. Currently the CTags code base is in absolutely no shape to support parallel tags parsing threads. All you _might_ be able to split relatively easily is the init/directory traversal and _one single_ parser thread.
And finally, I'm confident we have more sensible optimizations to perform about everywhere in the codebase (and esp. in the parsers).

So sure, multithreading probably _can_ have _some_ benefits if used very well, but it's likely not to be the most interesting improvement.

Also, another reason why it doesn't appeal me so much is that […] it'll be a large amount of error-prone work. Currently the CTags code base is in absolutely no shape to support parallel tags parsing threads. All you _might_ be able to split relatively easily is the init/directory traversal and _one single_ parser thread.

BTW, I don't mean that improving this area in the code is not a good idea, I do think it is (esp. for a possible future libctags). I just mean that if performance is the goal, it's probably not (currently) worth the effort, and that there are more important area to focus on.

BTW, firing up a profiler and profiling a buckload of data in a gazillion ways would probably be interesting.

GNU parallel may help you.

As mentioned before, optimizing reading can speed up ctags quite a bit.

Parallel execution of parsers could speed up things quite a bit if I/O comes from cache (and this is often the case the Nth time you run ctags on a directory from an editor).

@pragmaware IMO, a library should not fork.

If you read Japanese text, look at the article, https://qiita.com/dalance/items/c76141a097e25fabefe8 .
(After writing this comment, I found git repository for ptags (https://github.com/dalance/ptags). The page is written in English.)

It reports a tool named ptags the author developed. The tool is written in Rust and wraps ctags.
It runs ctags parallel for the set of input.
I don't loot at its internal. However, it is obviously runs multiple ctags processes.

The result is quite impressive. 5 times faster than single processing. The number of cpus is not written. The size of memory may be enough(=128GB). The author runs 10 times ptags for the same input set to make page cache hot.

Though these things should be done in wrappers like ptags, it is difficult to ignore this great result.
I hacked quickly. https://github.com/masatake/ctags/tree/parallel
Newly introduced option --_parallel runs multiple ctags processes in _parallel.

The number of worker processes, 8, is hard coded. My note PC has 8 cores.
MEMORY is 32GB. The target input is the latest Linux kernel source tree.
My .ctags is enough hairy.

The result is the mostly the same: 2 ~ 3 times faster.

[yamato@master]~/var/ctags-github% cat run.sh
cat run.sh
for i in $(seq 1 5); do
    echo "#"
    echo "# TRAIL #$i"
    echo "#"
    echo "# parallel 8"
    time  ./ctags    --_parallel -R  ~/var/linux > /dev/null
    echo "# single"
    time  ./ctags -o - --sort=no -R  ~/var/linux > /dev/null
done
[yamato@master]~/var/ctags-github% bash run.sh 
bash run.sh 
#
# TRAIL #1
#
# parallel 8

real    0m29.073s
user    3m5.791s
sys 0m32.347s
# single

real    1m21.397s
user    1m14.601s
sys 0m6.521s
#
# TRAIL #2
#
# parallel 8

real    0m29.746s
user    3m4.601s
sys 0m32.175s
# single

real    1m26.660s
user    1m19.176s
sys 0m7.191s
#
# TRAIL #3
#
# parallel 8

real    0m28.290s
user    3m2.524s
sys 0m31.081s
# single

real    1m21.927s
user    1m14.775s
sys 0m6.896s
#
# TRAIL #4
#
# parallel 8

real    0m28.644s
user    3m3.839s
sys 0m31.756s
# single

real    1m13.319s
user    1m7.294s
sys 0m5.843s
#
# TRAIL #5
#
# parallel 8

real    0m29.274s
user    3m9.387s
sys 0m32.363s
# single

real    1m13.621s
user    1m7.487s
sys 0m5.941s
[yamato@master]~/var/ctags-github% 

(I comapred both tags file. There is no diffference.)

Far from satisfying, but it is good place to start.

I wonder whether output of workers must be gathered or not.

hi @masatake I'm trying to close all my open tickets that I don't plan to work on. If you're interested in working on this ticket, could you please copy the text into a new ticket?

I will work on this item in the future. I would like to keep this item open because the record of the discussion here will be valuable for me.

@masatake you can still link to this ticket from a new one and retain the full history. This would really help me out, as I am trying to clean up my "Issues" tab for a new job and I don't want clutter like this ticket getting in the way.

@fommil, I don't see how you can override @masatake, who is the driving force behind Universal Ctags, with 2,700 commits versus your commit count of zero. Once you open a bug (or, in GitHub parlance, "issue"), this bugs becomes the property of the project. I believe you can unwatch it and not receive any emails about it.

Reopening.

@dtikhonov @masatake please close this ticket. It is the only ticket in my https://github.com/issues view that is not relevant to my work.

It is not possible to remove a ticket from this view unless the ticket is closed. Even if I unsubscribe.

Indeed, I was not aware that repo owners would have this control when I created the ticket, otherwise I would not have done so.

if you want to work on this, please create a new ticket and reference this ticket, all discussion is preserved. Or just copy paste the contents of https://github.com/universal-ctags/ctags/issues/761#issuecomment-373720839 into a fresh ticket.

I do not think this is much for me to ask.

Could you make a temporal GitHub account just for copy-paste?
So you can do copy-paste by yourself.
Then, you can remove the account.

sure, if that's the only way to fix this I can do that.

done! Thanks for letting me close this ticket. It cleans up my TODO task significantly.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sparkcanon picture sparkcanon  ·  3Comments

cweagans picture cweagans  ·  8Comments

blackb1rd picture blackb1rd  ·  8Comments

softinio picture softinio  ·  6Comments

cweagans picture cweagans  ·  4Comments