Zfs: Feature Request - online split clone

Created on 4 Feb 2014  ·  18Comments  ·  Source: openzfs/zfs

Hello Everyone,

I wasn't sure where the correct place to post a request is as this is not an issue, so feel free to close this if this is not correct.

I have a feature request that might be useful to others. I am looking for the capability to split a clone when online, pretty much the same as the netapp vol clone split

there are certain times that a clone has completely diverged from the parent and it doesn't make sense to have the two filesystems linked. The only way I can think to do this today is to perform a zfs send/recv, but this will likely require some downtime to ensure consistency.

What I am proposing is that since zfs knows the blocks that are associated with the parent filesystem, there is a possibility to copy those blocks to a new area and repoint the clone to use those blocks instead (hopefully I have explained that properly). The end state will be a split clone while the filesystem is online and active...

Documentation Feature Question

Most helpful comment

To comment on this, it would be nice if there would be a functionality to transform origin-clone relationships into a deduped kind: removing the logical links that keep the datasets from being individually destructed at will - while maintaining only one copy of the still shared data.

All 18 comments

It sounds like zfs promote may already do what you need.

       zfs promote clone-filesystem

           Promotes a clone file system to no longer be dependent on its "ori
           gin"  snapshot.  This  makes it possible to destroy the file system
           that the clone was created from. The clone parent-child  dependency
           relationship  is reversed, so that the origin file system becomes a
           clone of the specified file system.

           The snapshot that was cloned, and any snapshots  previous  to  this
           snapshot,  are  now owned by the promoted clone. The space they use
           moves from the origin file system to the promoted clone, so  enough
           space  must  be  available  to  accommodate these snapshots. No new
           space is consumed by this operation, but the  space  accounting  is
           adjusted. The promoted clone must not have any conflicting snapshot
           names of its own. The rename subcommand can be used to  rename  any
           conflicting snapshots.

I was looking at zfs promote, but this appears to just to flip the parent-child relationship...

what I was thinking was an end state where both file systems are completely independent from each other...

some use cases for this could be
cloning VM templates - having a base image that is cloned to create other VM's, that are in turn split from the template so the template can be updated/destroyed/recreated
database clones - cloning a prod db for dev that will undergo a lot of changes and which in turn might be the base for a testing clone itself, in the case it would be nice to split the dev from prod as the base snapshot might grow larger than having an independent file system for dev

After you clone original@snapshot you can modify both dataset and clone freely, they won't affect each other except that they share data still common to both on-disk.

If you want to destroy/recreate the template (original) you can simply destroy all snapshots on it (except the one(s) used as origins of clones), zfs rename original and zfs create a new one with the same name (origin property of clones isn't bound to the name of the original dataset, so you can rename the both freely).

Only downside to that is that all _unique_ data held in original@snapshot (= base of the clone) can't be released unless you are willing to destroy either the clone(s) or (after a promote of the clone) the original.

@greg-hydrogen in the end did you determine if zfs promote meet your needs? Or is there still a possible feature request here.

To comment on this, it would be nice if there would be a functionality to transform origin-clone relationships into a deduped kind: removing the logical links that keep the datasets from being individually destructed at will - while maintaining only one copy of the still shared data.

@behlendorf: it almost certainly doesn't meet the need.
http://jrs-s.net/2017/03/15/zfs-clones-probably-not-what-you-really-want/
does a good job of explaining the problem.

Here's what I'm trying to do conceptually:

user@backup:

  1. generate a date based snapshot
  2. send it from backup to test as a mirror
zfs snapshot backup/prod@20170901
zfs send -R backup/prod@20170901 | ssh test zfs recv ... test/mirror

user@test:

  1. create a place to sanitize the mirror
  2. sanitize it
  3. snapshot it
  4. clone the sanitized version for use
  5. use it
zfs clone test/mirror@20170901 test/sanitizing
sanitize sanitizing
zfs snapshot test/sanitizing@sanitized
zfs clone test/sanitizing@sanitized test/test
dirty test/test

user@backup:

  1. having used production further...
  2. create an updated snapshot
  3. send the incremental changes from prod to test
  4. delete the previous incremental marker (which in my case frees 70GB)
dirty prod/prod
zfs snapshot backup/prod@20170908
zfs send -I backup/prod@20170901 backup/prod@20170908 | ssh test zfs recv test/mirror
zfs destroy backup/prod@20170901

user@test:

  • this is where problems appear.
  • with some amount of cajoling, one can destroy the sanitizing volumes.
  • But, I'm left with test/mirror@20170901 which is the origin for the two remaining things: test/mirror@20170908 and test/test.
  • I could destroy the updated mirror (test/mirror@20170908) if I wanted to, but that doesn't do me any good (since my goal is to use that data).

In order for me to make progress, I actually have to run through sanitize, stop the thing that's using test, destroy test (completely), clone mirror as test, restart the thing using test, and then i can finally try to destroy the original snapshot. Or, I can decide that I'm going to take a pass, trigger a new snapshot on backup later, and send its increment over, delete the snapshot that was never mirrored to test, and try again.

Fwiw, to get a taste of this...:

zfs list -t all -o used,refer,name,origin -r test/mirror test/test
 USED   REFER  NAME                              ORIGIN
 161G   1.57M  test/mirror                       -
  65G   82.8G  test/mirror@2017081710            -
    0   82.4G  test/mirror@201709141737          -
3.25G   82.8G  test/test                         test/mirror@2017081710

(the numbers are really wrong, I actually have 1 volume with 4 contained volumes, hence the recursive flags...)

Now, I understand that I can use zfs send | zfs recv to break dependencies, and for small things that's fine. But this portion of my pool is roughly twice the available space in the pool, and one half is probably larger than that, which means performing that operation is problematic. It's also a huge amount of bytes to reprocess. My hope in using snapshots was to be able to benefit from COW, but instead, I'm being charged for COW because the branch point which will eventually have data used by neither side of my branching tree must still be paid for.

@behlendorf Hi, any progress on this? Splitting clone from it's original filesystem will be really great for VMs templates and/or big file-level restore. See the link @jsoref pasted above for a practical example.

@kpande: the goal is to pay (in space and data transfer) for what has changed (COW), not for the entire dataset (each time this operation happens).

If I had a 10TB moving dataset, and a variation of the dataset that I want to establish, sure, I could copy the 10TB, apply the variation, and pay for 20TB (if I have 20TB available). But, If my variation is really only 10MB different from the original 10TB, why shouldn't I be able to pay for 10TB+10MB? -- snapshots + clones give me that. Until the 10TB moves sufficiently that I'm now paying for 10TB (live + 10TB snapshot + 10TB diverged) and my 10MB variation moves so that it's now its own 10TB (diverged from both live and snapshot). In the interim, to "fix" my 30TB problem, I have to spend another 10TB (=40TB -- via your zfs send+zfs recv). That isn't ideal. Sure, it will "work", but it is neither "fast" nor remotely space efficient.

Redacted send/recv sounds interesting (since it more or less matches my use case) -- but while I can find it mentioned in a bunch of places, I can't find any useful explanation of what it's actually redacting.

Fwiw, for our system, I switched so that the sanitizing happens on the sending side (which is also better from a privacy perspective), which mostly got us out of the woods.

The are instances where the data variation isn't "redacting" and where the system has the resources for zfs snapshot+zfs send but doesn't really want to allocate the resources to host a second database to do the "mutation" -- and doesn't want to have to pay to send the entire volume between primary and secondary (i.e. it would rather send an incremental snapshot to a system which already has the previous snapshot).

Yes, I'm aware I could use dedup. We're paying for our cpus / ram, so dedicating constant cpu+ram to make a rare task (refresh mutated clone) fast felt like a poor tradeoff (I'd rather pay for a bit more disk space).

@kpande this link quite clearly shows the problem with current clones. After all, if a clone diverges so much from the base snapshot, the permanent parent->child relation between the two is a source of confusion. Splitting the clone would be a clear indication that they diverged so much to not be considered tied anymore.

But let me do a more practical example.

Let kvm/vmimages be a datastore container for multiple virtual disk images, with snapshots taken on a daily basis. I know the default answer would be "use a dataset for each disk", but libvirt pools does not play well with that. So we have something as:

kvm/vmimages
kvm/vmimages@snap1
kvm/vmimages@snap2
kvm/vmimages@snap3

At some point, something bad happens to vm disk (ie: serious guest filesystem corruption), but in the meantime other users are actively storing new, important data on the other disks. You basically have some contrasting requirements: a) to revert to the old, not corrupted data of yesterday, b) to preserve any new data uploaded, which are not found in any snapshots and c) to cause minimal service interruption.

Clones come to mind as a possible solution: you can clone kvm/vmimages@snap3 as kvm/restored to immediately restore service for the affected VM. So you now have:

kvm/vmimages
kvm/vmimages@snap1
kvm/vmimages@snap2
kvm/vmimages@snap3
kvm/restored   # it is a clone of snap3
kvm/restored@snap1
...

The affected VM runs from kvm/restored, while all other remains on kvm/vmimages. At this point, you delete all extra disks from kvm/restored and the original, corrupted disk from kvm/vmimages. All seems well, until you realize that the old corrupted disk image is still using real disk space, and any overwrite in kvm/restored consumes additional space due to the old, undeletable kvm/vmimages@snap3. You cannot remove this old snapshot without removing your clone also, and you can not simply promote kvm/restored and delete kvm/vmimages because it is not the only true "authoritative" data source (ie: real data are stored inside both dataset).

Splitting a clone from its source would completely solve the problem above. It is not clear to me how redacted send/recv would help in this case.

@kpande first, thank for sharing your view and your solution (which is interesting!). I totally agree that a careful, and very specific, guest configuration (and host dataset tree) can avoid the problem exposed above.

That said, libvirt (and its implementation of storage pools) does not play very well with this approach, especially when managing mixed environments with Windows virtual machines. Even more, this was a single example only. Splittable clones would be very useful, for example, when used to create a "gold master / base image", which can be instanced at will to create "real" virtual machines.

With the current state of affair, doing that will tax you heavily in allocated space, as you will not be able to ever remove the original, potentially obsolete, snapshot. What surprise me is that, being ZFS a CoW filesystem, this should be a relative simple operation: when deleting the original snapshot, "simply" mark as free any non-referenced block and remove the parent/child relation. In other words, let be the clone a real filesystem, untangled from any source snapshot.

Note that I used the world "simply" inside quotes: while it is indeed a simple logical operation, I am not sure if/how well it maps to the underlying zfs filesystem.

@kpande ok, fair enough - if a real technical problem exists, I must accept it. But this is different from stating that a specific use case in invalid.

If this view (ie: impossibility to split a clone from its original parent snapshot without involving the "mythical" BPR) is shared by the zfs developers, I think this FR can be closed.

Thanks.

+1 on needing this feature. Yes, send/recv could be used, but that would require downtime of whatever is using that dataset to switch from the old (clone) to the new dataset.

I've ran into situations with LXD where a container is copied (cloned), but that causes problems with my separately managed snapshotting.

@kpande: again, my use case has the entire dataset being a database, and a couple of variations of the database.

From what I've seen, it doesn't look like overlayfs plays nicely w/ zfs as the file system (it seems happy w/ zvols and ext4/xfs according to your notes). It _sounds_ like this approach would cover most cases, in which case documentation explaining how to set up overlayfs w/ ext4/xfs would be welcome.

That said, some of us are using zfs not just for the volume management but also for the acl/allow/snapshot behavior/browsing, and would like to be able to use overlayfs w/ zfs instead of ext4/xfs, so if that isn't possible, is there a bug for that? If there is, it'd be good if that was highlighted (from here), if not, if you're endorsing the overlayfs approach, maybe you could file it (if you insist, I could probably write it, but I don't know anything about overlayfs, and that seems like a key technology in the writeup).

From what I've seen, it doesn't look like overlayfs plays nicely w/ zfs as the file system (it seems happy w/ zvols and ext4/xfs according to your notes). It _sounds_ like this approach would cover most cases, in which case documentation explaining how to set up overlayfs w/ ext4/xfs would be welcome.

The overlayfs approach will not work for an extremely important, and common, use case: cloning a virtual image starting from another one (or a "gold master" template). In such a case, splitting the clone would be key to avoid wasted space as the original/cloned images diverge.

@ptx0 this only works if the guest OS supports overlayfs (so no Windows VMs support) and if the end user (ie: our customers) are willing to significantly change their VM images provisioning/installation. As a side note, while I completely understand - and accept - this PR closed on a technical basis (eg: if it involves BPR), it is quite frustating to have a legitimate user case stamped as "invalid". If it is not your use case, fine. But please do not suppose that no one has a valid use case for this feature.

Windows doesn't need overlayfs, it has built in folder redirection and roaming profiles.

Folder redirection, while existing since NT, dosn't always work relieable as software exists that (for obscure reasons) dosn't handle redirected folders correctly and simply fails when confronted with redirected Desktop or Documents. Apart from that, clones of Windows installations diverge, all by themselves, massively and quite quickly from the origin, courtesy of Windows Update - having different users logging on and off only speeds this up.

Was this page helpful?
0 / 5 - 0 ratings