Moby: COPY with excluded files is not possible

Created on 22 Aug 2015  ·  82Comments  ·  Source: moby/moby

I need to COPY a _part_ of a context directory to the container (the other part is subject to another COPY). Unfortunately, the current possibilities for this are suboptimal:

  1. COPY and prune. I could remove the unwanted material after an unlimited COPY. The problem is that the unwanted material may have changed, so the cache is invalidated.
  2. COPY every file in a COPY instruction of it own. This adds a _lot_ of unnecessary layers to the image.
  3. Writing a wrapper around the "docker build" call that prepares the context in some way so that the Dockerfile can comfortably copy the wanted material. Cumbersome and difficult to maintain.
arebuilder kinenhancement kinfeature

Most helpful comment

+1 for this issue, I think it could be supported in the same way a lot of glob libraries support it:

Here's a proposal to copy everything except node_modules

COPY . /app -node_modules/

All 82 comments

See https://docs.docker.com/reference/builder/#dockerignore-file
You can add entries to a .dockerignore file in the root of the project.

.dockerignore does not solve this issue. As I wrote, "the other part is subject to another COPY".

So you want to conditionally copy based on some other copy?

The context contains a lot of directories A1...A10 and a directory B. A1...A10 have one destination, B has another:

COPY A1 /some/where/A1/
COPY A2 /some/where/A2/
...
COPY A10 /some/where/A10/
COPY B some/where/else/B/

And this is awkward.

What part of it is awkward? Listing them all individually?

COPY A* /some/where/
COPY B /some/where/else/

Does this work?

The names A1..A10, B were fake. Besides, COPY A* ... throws together the _contents_ of the directories.

There are a couple of options I admit, but I think that all of them are awkward. I mentioned three in my original posting. A fourth option is to rearrange my source code permanently so that A1..A10 are moved in a new directory A. I was hoping that this was not necessary because an additional nesting level is not something to wish for, and my current tools needed to special-case my dockerised projects then.

(BTW, #6094 (following symlinks) would help in this case. But apparently, this is no option either.)

@bronger if COPY behaved exactly like cp, would that solve your use-case?

I'm not sure I 100% understand.
Maybe @duglin can have a look.

@bronger I think @cpuguy83 asked the right question, how would you solve this if you were using 'cp' ? I looked and didn't notice some kind of excludes option on 'cp' so I'm not sure how you would solve this outside of a 'docker build' either.

With cp behaviour, I could ameliorate the situation by saying

COPY ["A1", ... "A10", "/some/where/"]

It's still a mild maintenance problem because I would have to think of that line if I added an "A11" directory. But that would be acceptable.

Besides, cp does not need excludes, because copying everything and removing the unwanted parts has almost no performance impact beyond the copying itself. With docker's COPY, it means wrongly invalidated cache every time B is changed, and bigger images.

@bronger you can do:

COPY a b c d /some/where

just like you were suggesting.

As for doing a RUN rm ... after the COPY ..., yes you'll have on extra layer, but you still should be able to use the cache. If you see a cache miss due to it let me know, I don't think you should.

But

COPY a b c d /some/where/

copies the contents of the directories a b c d together, instead of creating the directories /some/where/{a,b,c,d}. It works like rsync with a slash appended to the src directory. Therefore, the _four_ instructions

COPY a /some/where/a/
COPY b /some/where/b/
COPY c /some/where/c/
COPY d /some/where/d/

are needed.

As for the cache ... if I say

COPY . /some/where/
RUN rm -Rf /some/where/e

then the cache is not used if e changes, although e is not effectively included into the operation.

@bronger yep, sadly you're correct. I guess we could add a --exclude zzz type of flag, but per https://github.com/docker/docker/blob/master/ROADMAP.md#22-dockerfile-syntax it may not get a lot of traction right now.

Fair enough. Then I will use a COPY+rm for the time being and add a FixMe comment. Thank you for your time!

Just to :+1: this issue. I regularly regret that COPY doesn't mirror rsync's trailing slash semantics. It means you can't COPY multiple directories in a single statement, leading to layer proliferation.

I regularly encounter a case where I want to copy many directories except for one (which will be copied later, because I want it to have different layer-invalidation effects), so --exclude would be useful, as well.

Also, from man rsync:

       A trailing slash on the source changes this behavior to avoid  creating
       an  additional  directory level at the destination.  You can think of a
       trailing / on a source as meaning "copy the contents of this directory"
       as  opposed  to  "copy  the  directory  by name", but in both cases the
       attributes of the containing directory are transferred to the  contain‐
       ing  directory on the destination.  In other words, each of the follow‐
       ing commands copies the files in the same way, including their  setting
       of the attributes of /dest/foo:

              rsync -av /src/foo /dest
              rsync -av /src/foo/ /dest/foo

I guess it can't be changed now without breaking a lot of wild Dockerfiles.

As a concrete example, let's say I have a directory looking like this:

/vendor
/part1
/part2
/part3
/...
/partN

I want something that looks like:

COPY /vendor /docker/vendor
RUN /vendor/build
COPY /part1 /part2 ... /partN /docker/ # copy directories part1-N to /docker/part{1..N}/
RUN /docker/build1-N.sh

So that part1-N doesn't invalidate building of /vendor. (since /vendor is rarely updated compared to part1-N).

I have previously worked around this by putting part1-N in their own directory, so:

/vendor
/src/part1-N

But I have also encountered this problem in projects that I am not at liberty to rearrange quite so easily.

@praller good example, we're facing the exact same issue. The main problem is that Go's filepath.Match doesn't allow much creativity compared to regular expressions (i.e. no anti pattern)

I just came up with a somewhat crack-brained workaround for this. COPY can't exclude directories, but ADD _can_ expand tgz.

It's one extra build step:
tar --exclude='./deferred_copy' -czf all_but_deferred.tgz .
docker build ...

Then in your Dockerfile:
ADD ./all_but_deferred.tgz /application_dir/
.. stuff in the rarely changing layers ..
ADD . /application_dir/
.. stuff in the often changing layers

That gives the full syntax of tar for including/excluding/whatever without gobs of wasted layers trying to include/exclude.

@jason-kane This is nice trick, thanks for sharing. One small point: it looks like you can't add the z (gzip) flag to tar—it changes the sha256 checksum value, which invalidates the Docker cache. Otherwise this approach works great for me.

+1 for this issue, I think it could be supported in the same way a lot of glob libraries support it:

Here's a proposal to copy everything except node_modules

COPY . /app -node_modules/

I come across the same problem as well, and it's kind of painful for me when my Java webapps is about 900MB but almost 80% of that is rarely changed.
It's an early state of my application and the folder structure is somewhat stable so I don't mind adding 6-7 COPY layer to be able to use the cache, but it will surely hurt in the long term when more and more files and directories are added

👍

I have the same problem although with docker cp, I want to copy all files from a folder except for one

Exact same issue here. I want to copy a git repo and exclude the .git directory.

@oaxlin you could use the .dockerignore file for that.

@antoineco are you sure that will work? It's been a while since I tried but I'm pretty sure .dockerignore didn't work with docker cp, at least at the time

@kkozmic-seek absolutely sure :) But the docker cp CLI subcommand you mentioned is different from the COPY statement found in the Dockerfile, which is the scope this issue.

docker cp has indeed nothing to do with Dockerfile and . dockerignore, but on the other hand it's not used for building images.

Would really like this as well - to speed up build I could copy some folder in earlier parts of the build and then cache would help me out ...

I'm not sure I understand what the use case is but wouldn't just touching the files to exclude before COPY solve the problem?

RUN touch /app/node_modules
COPY . /app
RUN rm /app/node_modules

AFAIK COPY doesn't overwrite file which is why I think this might work.

Oops, never mind that, looks like COPY actually overwrites files. I'm now a bit puzzled by https://nodejs.org/en/docs/guides/nodejs-docker-webapp/ which npm installs and then does a COPY . /usr/src/app. I guess it assumes that node_modules is docker ignored? On the other hand, having a COPY_NO_OVERWRITE (better name needed) command could be one way to achieve ignoring files during copy (you'd have to create empty files/dirs for stuff you want to ignore).

FWIW, I find this very ugly.

I found another hack solution:

Example project structure:
app/
config/
script/
spec/
static/
...

We want:

  1. Copy static/
  2. Copy other files
  3. Copy app/

Hack solution:
ADD ./static /home/app
ADD ["./[^s^a]*", "./s[^t]*", "/home/app/"]
ADD ./app /home/app

Second ADD is equivalent of: copy all, exept "./st" and "./a".
Any ideas for improvements?

Which is the status of comment?

👍

👍

👍

👍

what about having a .dockerignore file in the same fashion than .gitignore?

@mirestrepo See the first two follow-ups to this issue.

Currently this is a mega perf nerf for C# / dotnet development.

What i want:

  • First copy all external dll's to the docker images (everything except My*.dll)
  • Then copy all my dll's (starting with My.*.dll).

Now it seems this is not (easily) possible because i cannot copy everything except.

So either dlls are copied double Which increases the docker file size or everything is copied in one layer.
The later being a mega nerf because external dlls are copied everytime instead of cached.

@adresdvila thanks for the solutoin i was able to split it up in:

COPY ["[^M][^y]*","/app/"] 
COPY ./My* /app/

Although this still leave the problem that .json files are copied at the first command

Just chiming in to say thanks to @antoineco my problem is solved. I no longer copy the .git directory into my docker images.

This dramatically improved the image size, and makes my image much more friendly to the docker caching system.

I have the same problem. I have a big file which I want to copy before the rest of files so any change in the context does not repeat it as it takes a lot of time to copy (7 GB bin file). Are there any new workarounds?

The issue with COPY and prune approach is that the layer before pruning still continue to have all the data in.

COPY . --exclude=a --exclude=b would be extremely useful. What do you think, @cpuguy83?

@Nowaker I like it. Seems in line with tar and rsync anyway.
I guess this should support the same format as dockerignore?

@tonistiigi @dnephin

This case would be handled by #32507 I think.

@cpuguy83 Yeah. Most notably, in line with COPY --chown=uid:gid

@dnephin RUN --mount sounds like a totally different use case, centered around generating something based on data we don't need after the output has been generated. (E.g. compiling with Go, generating HTMLs from Markdown file, etc). RUN --mount is dope and I'd definitely use it in the project I'm currently working on (generating API docs using Sphinx).

COPY somedir --exclude=excludeddir1 --exclude=excludeddir2 is centered around copying data that has to end up in the image but splattered across multiple COPY statements, not just one. The goal is to avoid explicit COPY first second third .... eleventh destination/ when project has a lot of directories in root and it's subject to change/increase.

In my very case, I want to copy most of the files except those that are non-essential first to make sure cache is used if source files didn't change. Then, compile/generate - and use cache if the copied files didn't change (yay). At the very beginning copy the files I excluded previously which might have changed since the previous build but their change doesn't affect the compile/generate. Obviously, I have a ton files and directories in . that I want to COPY first, and only a couple that I want to COPY somewhere at the end.

The idea is that RUN --mount is able to solve a lot of problems. COPY --exclude solves only a single problem.

I'd rather add something that solves a lot problems than add a bunch of syntax to solve individual problems. You would use RUN --mount... rsync --exclude ... (or some script that copies individual things) and it would be the equivalent to COPY --exclude.

@dnephin Oh, I didn't think of RUN --mount rsync! Excellent! 👍

That's excellent indeed. However you won't be able to leverage caching efficiently @Nowaker, because the cache will be invalidated if anything changes in the mounted directory, not only what you want to rsync.

If you use the output of that rsync as an input for something else and no files actually changed in there the cache will pick up again. If you are really up for it you can do this currently with something like https://gist.github.com/tonistiigi/38ead7a4ed60565996d207a7d589d9c4#file-gistfile1-txt-L130-L140 . Only change in RUN --mount (or LLB in buildkit) is that you don't have to actually copy files between stages but can access them directly so it is much faster.

How about using https://docs.docker.com/develop/develop-images/multistage-build/?

FROM php:7.2-apache as source
COPY ./src/ /var/www/html/
RUN rm -rf /var/www/html/vendor/
RUN rm -rf /var/www/html/tests/

FROM php:7.2-apache
COPY --from=source /var/www/html/ /var/www/html/

@antoineco Welp, then it's not excellent any more. Thanks for pointing out..

@MartijnHols This is a good workaround. Thanks for posting.

To maintainers: that said, we could say "why implement --chown in COPY, you can use RUN chown in a multi-stage build". We need --exclude for sanity. There's too many workarounds in Dockerfiles these days.

I have a use case that would benefit from COPY --exclude. I have a big data folder which needs to be copied into the container in its entirety. The contents of that directory is subject to frequent changes. In order to improve cache performance, there is one single large file in the directory that I want to copy into its own layer, before copying the rest. As of now, it is unnecessarily verbose to describe this type of container.

What is the correct way of using layered caching centered around requirements.txt

I have this:

/root-app
 - /Dockerfile
 - /requirements.txt
 - /LICENSE
 - /helloworld.py
 - /app-1
     -/app-1/script1
     -/app-1/script2
     -/app-1/script3
 - /app-2
     -/app-2/script1

And Dockerfile:

FROM python:slim
COPY ./requirements.txt /
RUN pip install --trusted-host pypi.python.org -r /requirements.txt
WORKDIR /root-app
COPY . /helloworld
CMD ["python", "helloworld.py"]

What is the correct way to use the second COPY command to exclude the requirements build cache... and similarly layer my app-1 and app-2 if they don't change a lot?

@axehayz Not sure if this is what you're asking, but I would do something similar to the node workflow in https://medium.com/@guillaumejacquart/node-js-docker-workflow-12febcc0eed8.

I.e. it's OK for your second copy to just be COPY .; as long as your pip install comes before, this won't invalidate the cached for the installed packages.

Faced the same problem. At the moment I would prefer to expand the files in different directories.

I have another case for COPY --exclude=... --exclude=...

I'm trying to do a COPY --from=oldimage in order to cut down my image size and I need to copy most of the files, but without some of them. I can do it directory by directory, which is painful, but works... But being able to --exclude either a list of dirs/files or supply multiple --exclude options would be so so so much better and easier to maintain.

So after three and a half year there's no acknowledgement at all?

@asimonf There is tons of acknowledgement and back and forth to understand the use case. I assume you mean nobody has done this work? That is correct. We all have to make choices on the things we work on.

Honestly this can be done pretty easily using existing functionality even if it means you have to write a little bit of extra in your Dockerfile to make it happen.

# haven't tested exact syntax, but this is the general idea
FROM myRsync AS files
COPY . /foo
RUN mkdir /result && rsync -r --exclude=<pattern> /foo/ /result/

FROM scratch
COPY --from=files /result/* /

With buildkit you don't even need an extra stage

#syntax=docker/dockerfile:experimental
..
RUN --mount=target=/ctx rsync ... /ctx/ /src/

Unless I'm missing something using a multi stage build doesn't seem like the solution here. The cache is still invalidated at the COPY stage.

Unless I'm missing something using a multi stage build doesn't seem like the solution here. The cache is still invalidated at the COPY stage.

This is correct. As it is the issue I'm having right now.

Multi-stage works great for me.

I'm breaking my build into multi-stage, to fully leverage the cache, it looks something like this:

FROM alpine as source

WORKDIR /app
COPY . ./
RUN scripts/stagify-files

FROM node:12.4.0

WORKDIR /app

# Step 0: Setup environments
COPY --from=source /app/stage0 ./
RUN stage0-build.sh

# Step 1: Install npm packages
COPY --from=source /app/stage1 ./
RUN stage1-build.sh

# Step 2: Build project
COPY --from=source /app/stage2 ./
RUN stage2-build.sh

@zenozen the challenge with that process is that you've had to arrange your application layout specifically for a docker build, something that many people don't want to do.
Using docker is one of many considerations to balance when figuring out how to layout your application files (eg maintainability, ease of use for new hires, cross project standards, framework requirements, etc).

@cfletcher I'm not sure I completely get what you mean.

For the change I mentioned above, I actually moved my Dockerfile into a sub-sub-directory (which caused me many issues when trying to use rsync to stagify those files), meaning that I was trying to hide my Dockerfile.

The approach I proposed is general as I imagine. Let's say you have 100 files in your project. It simply picks 1 of them making stage0, then 1+10 of them to make stage 1, and then all 100 to form stage 2. Each stage is stacked on top of the previous one, and have a different build command. For complicated project structure, it just means the stagify-files logic would be complicated.

For me, the biggest reason is I split my code into modules, and I need all package.json files copied before running npm install.

Would also like some sort of exclude argument for copy. We have 20+ files and 10+ directories in the root directory. We code heavily on 2 of the directories and a few of the files. I'd like to split these into two COPY layers. One for the static files and directories which we never touch, and the other for the files and directories that we always touch.

Its very sad this is still getting ignored. This would have helped me save 5 minutes per build if I could exclude ONE directory to not invalidate cache.

With buildkit the cache is not dependent on the parent image like it is pre-buildkit.
So yes with the mentioned rsync solution you will take a hit in that you'll need to sync every time there is some change, but subsequent stages will be cached based on content, and if the content of what is transferred is not changed then... at least in my complete on the spot theory those stages should use the cache.

It's sad adding a simple --exclude flag to COPY is such a hard sell. It's in TOP30 most upvoted tickets, and a relatively easy implementation-wise compared to other TOP30 tickets. :(

It's not controversial, it requires work.

@cpuguy83 Yay. It looked like controversial / somewhat rejected. Does it mean a proper PR with COPY --exclude would likely be accepted, if it passes quality standards?

I can't speak for every maintainer, but I have spoken to @tonistiigi a month or so ago about this and IIRC the biggest hurdle how this relates to dockerignore, the syntax, etc. (and the fact the dockerignore is insufficient syntactically).

The change would need to go into buildkit.

Upvoting COPY --exclude=... --exclude=... - also needed in my case of a monolith repo

Upvoting! I tried with COPY !(excludedfile) . which should works on Bash but it doesn't on Docker

I don't like the suggestions to have to repeat everything inside the .dockerignore file for every COPY statement in the Dockerfile. Being able to remain DRY with what's going to be a part of the image and not should be a priority, imho.

Looking at #33923, I don't think it's coincidental that what you want to exclude from the build context is exactly the same stuff you want to be excluded from COPY statements. I believe something like this would be a good solution:

COPY --use-dockerignore <source> <target>

Or perhaps even something like this:

COPY --use-ignorefile=".gitignore" <source> <target>

Seeing how .dockerignore is usually a 90% reproduction of .gitignore already, it feels extra annoying having to repeat every ignored file and folder yet again for each and every COPY statement. There has to be a better way.

@asbjornu .gitignore and .dockerignore are not the same things at all. Especially for multistage builds where artifacts are generated on a build stage and not present in git at all, nevertheless should be included in the resulting image.
And yes, with multistage builds introduced THERE SHOULD BE an ability to use different .dockerignore files per stage - absolutely.

I often want to copy outside of "docker build". In these cases, .dockerignore does nothing. We need an amendment to "docker cp" its the only sensible solution

It's been 5 years that this issue was opened. In September 2020, I still want this. A lot of people have suggested hacks to workaround but almost all of them and others have requested exclude flag in some form or another. Please don't let this issue go unresolved for more time now.

If you want something, you need to work on it or find someone to work on it.

If you want something, you need to work on it or find someone to work on it.

First we need to know whether upstream wants this.

After source code review, I think we should extend copy function here https://github.com/tonistiigi/fsutil/blob/master/copy/copy.go firstly. After that, we can extend backend.go in libsolver, and only after will be possilble extend AST and frontend of buildkit.
But after that, the copy will be close to rsync semantic than unix cp.

UPDATE: yes, after extending copy.go everything will be close to https://github.com/moby/buildkit/pull/1492 plus parsing list of excludes.

Was this page helpful?
0 / 5 - 0 ratings