Do You REALLY Know inotify?
General Overview
Whenever you want to monitor a file or directory with inotify, you use inotify_init()
to get a file descriptor known as “inotify instance”, and then you create a watcher to watch your path and assign that path to the inotify instance using the inotify_add_watch()
syscall. By performing these tasks, you are telling the fsnotify
subsystem which events you want to monitor over the specified path. Then, you can poll
or just read
from the inotify instance and parse the buffer that you read from it as an inotify_event
structure or series of this structure. Finally, you can use inotify_rm_watch()
to remove the specified watcher from the watchers list.
inotify
Instance Initialization
Definition of inotify_init
The definition of the inotify_init()
system call is provided below:
1
2
3
4
SYSCALL_DEFINE0(inotify_init)
{
return do_inotify_init(0);
}
Seems simple enough. Let us check what happens inside of do_inotify_init()
. I’ll provide only the key lines here:
1
2
3
4
5
6
7
8
9
10
11
12
/* inotify syscalls */
static int do_inotify_init(int flags)
{
struct fsnotify_group *group;
int ret;
group = inotify_new_group(inotify_max_queued_events);
ret = anon_inode_getfd("inotify", &inotify_fops, group,
O_RDONLY | flags);
return ret;
}
A group
is a “thing” that intends to get notification of filesystem events. Each group must implement its fsnotify_ops
. All of fsnotify_ops
can be found in inotify_fsnotify.c, and the operation that we care the most about is handle_inode_event
, which points to inotify_handle_inode_event()
. I will come back to this function shortly, but for now, let’s move on.
inotify_new_group()
allocates a group
and assigns its fsnotify_ops
argument to inotify_fsnotify_ops
. It then allocates an area of kernel memory for inotify_event_info
structure. It is quite similar to the inotify_event
structure that we usually use in userspace, but it includes an additional member named fse
which is a fsnotify_event
. This is a list_head
that allows us to iterate over all of the information about the original object we want to send to a group.
The anon_inode_getfd
function brings up a mysterious topic, which I’ll discuss in details. I created a simple program that uses inotify, and placed a breakpoint at inotify_init()
. When the syscall got called, I listed all of its file descriptors and used lsof
to figure out what was going on, and the following output is the result:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(gdb) frame
#0 main () at notify.c:10
10 inotify_fd = inotify_init();
(gdb) n
(gdb) info inferiors
Num Description Connection Executable
* 1 process 12865 1 (native) /tmp/notify
(gdb) !ls -l /proc/12865/fd
total 0
lrwx------ 1 iman iman 64 Aug 28 12:59 0 -> /dev/pts/8
lrwx------ 1 iman iman 64 Aug 28 12:59 1 -> /dev/pts/8
lrwx------ 1 iman iman 64 Aug 28 12:59 2 -> /dev/pts/8
lr-x------ 1 iman iman 64 Aug 28 12:59 3 -> anon_inode:inotify
(gdb) !lsof -p 12865 | grep a_inode
notify 12865 iman 3r a_inode 0,14 0 1048 inotify
It appears that fd
number 3 is linked to a file called anon_inode:inotify
. The following quote is from the proc_pid_fd(5) manual page:
For file descriptors that have no corresponding inode (e.g., file descriptors produced by bpf(2), epoll_create(2), eventfd(2), inotify_init(2), perf_event_open(2), signalfd(2), timerfd_create(2), and userfaultfd(2)), the entry will be a symbolic link with contents of the form
1 anon_inode:<file-type>
And the lsof
output indicates that our inotify instance creates a device somewhere, and it has a major number of zero. The file devices.txt shows that major number 0 is for unnamed devices, but what is this anon_inode
exactly?
Demystifying anon_inode
There is a pseudo-filesystem called anon_inodefs
that runs on a virtual block device and handles all operations in memory. Like all filesystems, it has a superblock that defines the filesystem. As expected, inotify defines a file_operations
for anon_inodefs
and passes it to anon_inode_getfd
. The file_system_type
of anon_inodefs
contains the implementations of init_fs_context
and kill_sb
, which allow us to create or free the context and the virtual block device of anon_inodefs
as needed.
1
2
3
4
5
static struct file_system_type anon_inode_fs_type = {
.name = "anon_inodefs",
.init_fs_context = anon_inodefs_init_fs_context,
.kill_sb = kill_anon_super,
};
There is also a dentry_operations
. Do you remember the command ls -l /proc/<pid>/fd
? The code below is called in the kernel and is rather straight forward. In this scenario, dentry->d_name.name
equals to “inotify”:
1
2
3
4
5
6
7
8
9
10
11
12
/*
* anon_inodefs_dname() is called from d_path().
*/
static char *anon_inodefs_dname(struct dentry *dentry, char *buffer, int buflen)
{
return dynamic_dname(buffer, buflen, "anon_inode:%s",
dentry->d_name.name);
}
static const struct dentry_operations anon_inodefs_dentry_operations = {
.d_dname = anon_inodefs_dname,
};
Paths in this filesystem are like this, however in other filesystems, such as ext4, they appear like ‘/path/to/file’. However, they are just strings, with no magic included. This dentry_operations
gets assigned to the anon_inodefs
context in the anon_inodefs_init_fs_context()
function.
1
2
3
4
5
6
7
static int anon_inodefs_init_fs_context(struct fs_context *fc)
{
// [...]
struct pseudo_fs_context *ctx = init_pseudo(fc, ANON_INODE_FS_MAGIC);
ctx->dops = &anon_inodefs_dentry_operations;
// [...]
}
The init_pseudo()
function is a typical aid for pseudo-filesystems. This function is used by unmountable file systems such as sockfs and pipefs to allocate filesystem context. It just allocates an area for pseudo_fs_context
and fills the context for us. If you follow the pseudo_fs_get_tree
in the get_tree
operation of pseudo_fs_context_ops
, you will find the function that actually builds the block device for our inotify by using get_anon_bdev(dev_t *p)
.
1
2
3
4
5
6
7
8
9
10
int get_anon_bdev(dev_t *p)
{
int dev;
dev = ida_alloc_range(&unnamed_dev_ida, 1, (1 << MINORBITS) - 1,
GFP_ATOMIC);
// [...]
*p = MKDEV(0, dev);
return 0;
}
It allocates a unique id from the unnamed_dev_ida
XArray for our virtual block device by using ida_alloc_range()
, sets the major number to zero, and dereferences the p
to fill it with MKDEV(0, minor)
. As you may already know, dev_t
is simply a typedef
of u32
that stores the block device’s major and minor numbers. The super_block
structure will store this dev_t
for us, and lsof
can find out about these numbers by requesting it.
anon_inode_getfd
Function
Now that we know what anon_inodefs
actually is, we can finally talk about anon_inode_getfd()
. It returns a file descriptor with file_operations
of inotify_file_operations
in the anon_inodefs
filesystem.
inotify_add_watch()
System Call
Again, I’ll share only the important lines (with some comments):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SYSCALL_DEFINE3(inotify_add_watch, int fd, const char __user * pathname,
u32 mask)
{
struct fsnotify_group *group;
struct inode *inode;
struct path path;
struct fd f;
int ret;
unsigned flags = 0;
f = fdget(fd);
if (!(mask & IN_DONT_FOLLOW))
flags |= LOOKUP_FOLLOW; // Dereference pathname if it is a symbolic link.
if (mask & IN_ONLYDIR)
flags |= LOOKUP_DIRECTORY; // Find inode of *pathname* only if it is a directory
ret = inotify_find_inode(pathname, &path, flags, (mask & IN_ALL_EVENTS));
/* inode held in place by reference to path; group by fget on fd */
inode = path.dentry->d_inode;
group = f.file->private_data;
/* create/update an inode mark */
ret = inotify_update_watch(group, inode, mask);
return ret;
}
The fdget()
converts a fd
number to a fd
structure which allows us to access the file
structure. This helps inotify_add_watch()
to read the priv
of the file
, which in our case, is the group
.
The inotify_find_inode()
function as its name suggests, resolves a user-provided path to a specified inode.
The inotify_update_watch()
updates the watch and returns its watch descriptor; if you’re not watching that path already, it creates a new watch for you. Marks (fsnotify_marks
to be precise) are objects associated with core inodes that allow fsnotify
listeners to exclude or include events that match a specific mask. Each group
has its own mark.
Relation Between VFS
and fsnotify
Whenever you call the write()
or read()
syscalls in your program, kernel will eventually call vfs_write()
or vfs_read()
for you. If you read more than 0 bytes using vfs_read()
, fsnotify_access(file)
gets called on that file, and for vfs_write()
, fsnotify_modify(file)
will notify the fsnotify
subsystem. If the file is the root
(e.g., you’ve marked a file only) fsnotify()
will get called immediately; otherwise, the kernel will look for the parent and then call fsnotify()
over that dentry
. The real magic happens in the fsnotify()
function. This is where the fsnotify
sends the event to the group
by send_to_group()
function. Remember the inotify_handle_inode_event()
that I mentioned earlier? It becomes handy here. It inserts the event into your group->notification_list
. The user won’t be aware of this event until they read from the inotify instance. By ignoring all of the locking and waitqueue stuff, inotify_read()
looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
static ssize_t inotify_read(struct file *file, char __user *buf,
size_t count, loff_t *pos)
{
struct fsnotify_group *group;
struct fsnotify_event *kevent;
char __user *start;
int ret;
start = buf;
group = file->private_data;
while (1) {
kevent = get_one_event(group, count);
if (kevent) {
ret = copy_event_to_user(group, kevent, buf);
fsnotify_destroy_event(group, kevent);
buf += ret;
count -= ret;
continue;
}
if (start != buf)
break;
}
return ret;
}
The get_one_event()
fetches an event from the group if it’s small enough to fit in count
, and the copy_event_to_user()
will write the event to the userspace buffer specified in your read()
call.