Linux - understanding the mount namespace & clone

2019-02-04 15:57发布

I am reading the mount & clone man page. I want to clarify how CLONE_NEWNS effects the view of file system for the child process.

(File hierarchy)

Lets consider this tree to be the directory hierarchy. Lets says 5 & 6 are mount points in the parent process. I clarified mount points in another question.

So my understanding is : 5 & 6 are mount points means that the mount command was used previously to 'mount' file systems (directory hierarchies) at 5 & 6 (which means there must be directory trees under 5 & 6 as well).

From mount man page :

 A mount namespace is the set of filesystem mounts that are visible to a process. 

From clone man page :

Every process lives in a mount namespace.  The namespace of a process is the data 
(the set of mounts) describing the file hierarchy as seen by that process.  After 
a fork(2) or clone() where the CLONE_NEWNS flag is not set, the child lives in the 
same mount namespace as the parent.

Also :

After a clone() where the CLONE_NEWNS flag is set, the cloned child is started in a 
new mount namespace, initialized with a copy of the namespace of the parent.

Now if I use clone() with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ? Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.

If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?

Thanks.

2条回答
走好不送
2楼-- · 2019-02-04 16:11

The “mount namespace” of a process is just the set of mounted filesystems that it sees. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, you must decide what to do when creating a child process with clone().

Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes: there was one global mount namespace, seen by all processes, and if any change was made (e.g. using the mount command) all processes would immediately see that change irrespective of their relationship to the mount command.

With per-process mount namespaces, a child process can now have a different mount namespace to its parent. The question now arises:

Should changes to the mount namespace made by the child propagate back to the parent?

Clearly, this functionality must at least be supported and, indeed, must probably be the default. Otherwise, launching the mount command itself would effect no change (since the filesystem as seen by the parent shell would be unaffected).

Equally clearly, it must also be possible for this necessary propagation to be suppressed, otherwise we can never create a child process whose mount namespace differs from its parent, and we have one global mount namespace again (the filesystem as seen by init).

Thus, we must decide when creating a child process with clone() whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the parent, which it can change (necessary for changes to propagate back, as when you launch mount from the shell).

If the CLONE_NEWNS flag is passed to clone(), the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parent's mount data structures, where changes made by the child will be seen by the parent (so the mount command itself can work).

Now if I use clone with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ?

Yes. It sees the exact same tree as its parent after the call to clone().

Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.

Yes. Since you've used CLONE_NEWNS, the child can unmount one device from 5 and mount another device there, and only it (and its children) could see the changes. No other process can see the changes made by the child in this case.

If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?

No. If you've used CLONE_NEWNS, the changes made in the child cannot propagate back to the parent.

If you haven't used CLONE_NEWNS, the child would have received a pointer to the same mount namespace data as its parent, and any changes made by the child would be seen by any process that shares those data structures, including the parent. (This is also the case when the new child is created using fork().)

查看更多
Emotional °昔
3楼-- · 2019-02-04 16:25

I don't have enough reputation points to add a comment so instead adding this comment as an answer. It's just an add on to Emmet's answer.

AFAICU, If a process is created with CLONE_NEWNS flag set, it can only mount those file systems which have FS_USERNS_MOUNT flag set. And almost all disk based file systems does not set this flag (due to security reasons). In do_new_mount, there is this check:

        if (user_ns != &init_user_ns) {
            if (!(type->fs_flags & FS_USERNS_MOUNT)) {
                    put_filesystem(type);
                    return -EPERM;
            }

Please correct me if I am wrong

查看更多
登录 后发表回答