So I created i3.large with NVME disk on each nodes, here was my process :
- lsblk -> nvme0n1 (check if nvme isn't yet mounted)
- sudo mkfs.ext4 -E nodiscard /dev/nvme0n1
- sudo mount -o discard /dev/nvme0n1 /mnt/my-data
- /dev/nvme0n1 /mnt/my-data ext4 defaults,nofail,discard 0 2
- sudo mount -a (check if everything is OK)
- sudo reboot
So all of this works, I can connect back to the instance. I have 500 Go on my new partition.
But after I stop and restart the EC2 machines, some of them randomly became inaccessible (AWS warning only 1/2 test status checked)
When I watch the logs of why it is inaccessible it tells me, it's about the nvme partition (but I did sudo mount -a to check if this was ok, so I don't understand)
I don't have the AWS logs exactly, but I got some lines of it :
Bad magic number in super-block while trying to open
then the superblock is corrupt, and you might try running e2fsck with an alternate superblock:
/dev/fd/9: line 2: plymouth: command not found
Stopping and starting an instance erases the ephemeral disks, moves the instance to new host hardware, and gives you new empty disks... so the ephemeral disks will always be blank after stop/start. When an instance is stopped, it doesn't exist on any physical host -- the resources are freed.
So, the best approach, if you are going to be stopping and starting instances is not to add them to
/etc/fstab
but rather to just format them on first boot and mount them after that. One way of testing whether a filesystem is already present is using thefile
utility andgrep
its output. If grep doesn't find a match, it returns false.The NVMe SSD on the i3 instance class is an example of an Instance Store Volume, also known as an Ephemeral [ Disk | Volume | Drive ]. They are physically inside the instance and extremely fast, but not redundant and not intended for persistent data... hence, "ephemeral." Persistent data needs to be on an Elastic Block Store (EBS) volume or an Elastic File System (EFS), both of which survive instance stop/start, hardware failures, and maintenance.
It isn't clear why your instances are failing to boot, but
nofail
may not be doing what you expect when a volume is present but has no filesystem. My impression has been that eventually it should succeed.But, you may need to
apt-get install linux-aws
if running Ubuntu 16.04. Ubuntu 14.04 NVMe support is not really stable and not recommended.Each of these three storage solutions has its advantages and disadvantages.
The Instance Store is local, so it's quite fast... but, it's ephemeral. It survives hard and soft reboots, but not stop/start cycles. If your instance suffers a hardware failure, or is scheduled for retirement, as eventually happens to all hardware, you will have to stop and start the instance to move it to new hardware. Reserved and dedicated instances don't change ephemeral disk behavior.
EBS is persistent, redundant storage, that can be detached from one instance and moved to another (and this happens automatically across a stop/start). EBS supports point-in-time snapshots, and these are incremental at the block level, so you don't pay for storing the data that didn't change across snapshots... but through some excellent witchcraft, you also don't have to keep track of "full" vs. "incremental" snapshots -- the snapshots are only logical containers of pointers to the backed-up data blocks, so they are in essence, all "full" snapshots, but only billed as incrememental. When you delete a snapshot, only the blocks no longer needed to restore either that snapshot and any other snapshot are purged from the back-end storage system (which, transparent to you, actually uses Amazon S3).
EBS volumes are available as both SSD and spinning platter magnetic volumes, again with tradeoffs in cost, performance, and appropriate applications. See EBS Volume Types. EBS volumes mimic ordinary hard drives, except that their capacity can be manually increased on demand (but not decreased), and can be converted from one volume type to another without shutting down the system. EBS does all of the data migration on the fly, with a reduction in performance but no disruption. This is a relatively recent innovation.
EFS uses NFS, so you can mount an EFS filesystem on as many instances as you like, even across availability zones within one region. The size limit for any one file in EFS is 52 terabytes, and your instance will actually report 8 exabytes of free space. The actual free space is for all practical purposes unlimited, but EFS is also the most expensive -- if you did have a 52 TiB file stored there for one month, that storage would cost over $15,000. The most I ever stored was about 20 TiB for 2 weeks, cost me about $5k but if you need the space, the space is there. It's billed hourly, so if you stored the 52 TiB file for just a couple of hours and then deleted it, you'd pay maybe $50. The "Elastic" in EFS refers to the capacity and the price. You don't pre-provision space on EFS. You use what you need and delete what you don't, and the billable size is calculated hourly.
A discussion of storage wouldn't be complete without S3. It's not a filesystem, it's an object store. At about 1/10 the price of EFS, S3 also has effectively infinite capacity, and a maximum object size of 5TB. Some applications would be better designed using S3 objects, instead of files.
S3 can also be easily used by systems outside of AWS, whether in your data center or in another cloud. The other storage technologies are intended for use inside EC2, though there is an undocumented workaround that allows EFS to be used externally or across regions, with proxies and tunnels.
You may find useful new EC2 instance family equipped with local NVMe storage: C5d.
See announcement blog post: https://aws.amazon.com/blogs/aws/ec2-instance-update-c5-instances-with-local-nvme-storage-c5d/
Some excerpts from the blog post:
I just had a similar experience! My C5.xlarge instance detects an EBS as nvme1n1. I have added this line in fstab.
After a couple of rebooting, it looked working. It kept running for weeks. But today, I just got alert that instance was unable to be connected. I tried rebooting it from AWS console, no luck looks the culprit is the fstab. The disk mount is failed.
I raised the ticket to AWS support, no feedback yet. I have to start a new instance to recover my service.
In another test instance, I try to use UUID(get by command blkid) instead of /dev/nvme1n1. So far looks still working... will see if it cause any issue.
I will update here if any AWS support feedback.
================ EDIT with my fix ===========
AWS doesn't give me feedback yet, but I found the issue. Actually, in fstab, whatever you mount /dev/nvme1n1 or UUID, it doesn't matter. My issue is, my ESB has some errors in file system. I attached it to an instance then run
After fixes a couple of file system error, put it in fstab, reboot, no problem anymore!