From the book "Hadoop The Definitive Guide", under the topic Namenodes and Datanodes it is mentioned that:
The namenode manages the filesystem namespace. It maintains the
filesystem tree and the metadata for all the files and directories in
the tree. This information is stored persistently on the local disk in
the form of two files: the namespace image and the edit log.
secondary namenode, which despite its name does not act as a namenode.
Its main role is to periodically merge the namespace image with the
edit log to prevent the edit log from becoming too large.
I am having some confusion with these files namespace and edit log.
Namespace image is for storing the metadata.
So, my questions are
- What is the edit log? And what is its role?
- Can you explain the statement "Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming
too large."?
Please can anyone explain me what is the edit log? What is the role of this log file?
Initially when the NameNode first starts up the fsimage
file will itself be empty. When ever NameNode receives a create/update/delete request then that request is first recorded to edits
file for durability once persisted in the edits
file an in-memory update is also made. Because all read requests are served from in-memory snapshot of the metadata.
Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
So, you see the edits
file keeps on growing with out bounds at this point. Now if the NameNode is restarted or for some reason went down and brought back up, it has no memory representation of the metadata so, it has to read the edits
file and rebuild the snapshot in-memory, which might take a while based on the edits
file size.
As edits
itself is a WAL (write ahead log) all the events have to written one after another (append only), there could be no updates in the file to prevent random disk seeks.
To prevent this overhead (or to keep edits
file manageable) SecondaryNameNode was introduced. The sole purpose of the SNN is to make sure the edits
file does not grow out of bounds. So, by default SNN triggers a process called as checkpointing
when ever edits
file reaches 64MB or for every one hour (which ever comes first).
Checkpointing process it self is simple, the SNN tells the NN to role its current edits
log and create a new edits files called edits.new
, SNN then copies over the fsimage and edits file from NN and starts applying the events in the edits file to already existing fsimage file (brought from NN), once completed the new fsimage file is sent back to NN and the NN replaces the existing fsimage with the new one sent over by SNN and renames the edits.new
to edits
. The NN now has a current version of fsimage
which has events applied from the edits
file.
So, that if the NameNode is restarted after checkpointing has been completed, NameNode has to just load the fsimage
to memory and apply just the recents updates from edits
log (which got filled after the checkpoint has been completed) to make sure it has the up to date view of the namespace which more efficient.