Friday, November 4, 2016

VMware ESXi - I/O Block Size in Virtual Environments

This blog article is about the I/O size or block sizes in virtualized environments. I am sure you came along this if you are dealing with databases or other systems. Do you remember it would be best to keep a Microsoft SQL DB on a 64 KB formatted volume or was that the NTFS allocation size, wait was the storage system involved here as well? So I can tell you that if you still believe a Microsoft SQL server operates with only 64 KB blocks that this is not true as there are many kinds of block sizes dependent on what the SQL DB is doing. There is clearly confusion between I/O size and the NTFS allocation size, VMFS block size or NFS block size. On top you have an underlying storage system which is organised in volumes while using meta structures to organise underlying physical disks or flash. This blog article hopefully shines a little bit of light into this. The following figure shows a 64 KiB Write I/O traversing through the different levels we use in a virtualized environment.

Figure 1: I/O workflow

Sectors and Clusters

Before we get into the Windows NTFS filesystem it is important to understand what a sector or a cluster is. A sector is the smallest physical storage unit on a disk. The standard sector size is usually 512 byte and got introduced with the inception of hard disk drives. There are drives in the market with 4096 bytes sectors but still the most of all spinning disks today are still using 512 byte as their sector size.

To reduce the overhead of on disk data structures continuous groups of sectors called clusters got introduced. A cluster is the unit of disk space allocation for files and directories also called allocation unit. Logically a 4 KiB (4096 bytes) cluster contain 8 sectors (8 x 512 byte).

Flash devices are grouped in sector-like pages (8 KiB in size) which are then grouped in blocks and planes instead of platters and spindles in the physical disk world. This backward compatibility is built into flash with the name Flash Translation Layer (FTL) to translate a Logical Block Address (LBA) to a physical page number (PPN). A block usually contains 2 MiB of data which is a result of 256 x 8 KiB pages.

Windows NTFS

The Windows file system NTFS organises the underlying hard disk based on cluster size which are named “Allocation Unit Size”. The size of the cluster size is the smallest amount of space a file could use. The default size is 4 KiB and the Allocation Size Unit can be configured when formatting a disk with 512, 1024, 2048, 4096, 8192, 16384, 32768 or 65536 byte. Please follow this link if you are interested what Microsoft recommends about the default cluster size. The last three options are shown as 16 K, 32 K, 64 K which is another way of showing Kibibyte so please be aware that 16 K is not 16 KB (16,000) but 16384 byte or 16 KiB (2^14). So take an example where your application is constantly writes very small 512 byte files. The result would be a waste of space on your NTFS file system. Let’s say you have 10,000 files and your disk is formatted with the allocation unit size of 4 KiB and files two examples of 512 bytes and 4 KiB gets created.

10,000 x 512 byte files = 5,120 KiB space usage / 40,960 KiB used with 4 KiB Allocation Unit Size
10,000 x 4 KiB files = 40,960 KiB space usage / 40,960 KiB used with 4 KiB Allocation Unit Size

In the first example you will utilise 40,960 KiB for just 5,120 KiB of data just because of the 4 KiB allocation unit size while the second example with a file size of 4 KiB perfectly matches.

From the performance perspective spinning drives could give you a performance benefit when for example your Database is mostly doing 64 KiB I/Os and your allocation unit size is also 64 KiB as one block will fit into one cluster and won’t get distributed on many small clusters which could fragment single 64 KiB I/O’s. It will be also more efficient on metadata as there is less overhead. Flash devices should not give any performance penalty because of the allocation unit size of 4 KiB but the amount of metadata is much higher when you use systems with big files. In general it should not make huge performance difference with flash. Most of the customers I talk to all use the standard unit allocation size and I am also a big fan of sticking with standards as much as possible. In my opinion if there is not a special need to change the allocation unit size I would leave it with 4 KiB. To find out your volume serial number, sector's, allocation unit size etc. on a Windows Server use fsutil like shown below:

C:\Windows\system32>fsutil fsinfo ntfsinfo c:

NTFS Volume Serial Number : 0x7498f02e98efed14

NTFS Version : 3.1

LFS Version : 2.0

Number Sectors : 0x000000000634f7ff

Total Clusters : 0x0000000000c69eff

Free Clusters : 0x000000000001dae3

Total Reserved : 0x0000000000000fe0

Bytes Per Sector : 512

Bytes Per Physical Sector : 512

Bytes Per Cluster : 4096

Bytes Per FileRecord Segment : 1024

Clusters Per FileRecord Segment : 0

Mft Valid Data Length : 0x0000000015fc0000

Mft Start Lcn : 0x00000000000c0000

Mft2 Start Lcn : 0x0000000000000002

Mft Zone Start : 0x00000000004f2340

Mft Zone End : 0x00000000004f2b40

Resource Manager Identifier : BC106797-18B8-11E4-A61C-E413A45A8CC7

VMFS

The Virtual Machine File System (VMFS) is a highly scalable symmetric clustered file system to host virtual machines on block storage. VMFS is supported on DAS (Direct Attached Storage) using a SCSI Controller and disks within the server or shared block storage using either iSCSI (Internet Small Computer Systems Interface), FC (Fibre Channel) and FCoE (Fibre Channel over Ethernet). If you want to know more about the depth of VMFS (based VMFS-3 but a lot of the basics are still the same) please follow this link from Satyam Vaghani (VMware’s ex-CTO and PernixData’s ex-CTO). I won’t got into detail how VMFS-3 was structured as VMFS-5 got introduced with ESXi 5.0. I am sure not everyone has upgraded their VMFS-3 to VMFS-5 yet but in case you have not, you really should do because VMFS-3 has many limitations. Not all features will be available with an upgraded VMFS-3 but the most important ones. Please follow VMware’s KB 2003813 if you would like to know more about VMFS-3 vs. VMFS-5. Quick summarised these are the key new features of VMFS-5 (config maximums for ESXi 6.0 you find here):

Unified 1 MiB block size. Previously with VMFS-3 there was a possibility to create the block size with 1, 2, 4 or 6 MiB block size which then based on the block size limited the maximum size of a VMDK.
Large Single Volumes. Since VMFS-5 there is support for 64 TiB per single VMFS file system (maximum of a 62 TiB VMDK) vs. 2 TiB (minus 512 byte) before.
Smaller Sub-Blocks. The sub-block is now 8 KiB rather than 64 KiB with VMFS-3 before and the number has been increased from 4,000 sub-blocks to 32,000.
Increased file count. The current version of VMFS-5 now support 130,000 files compared to about 30,000 with VMFS-3.
ATS enhancements. ATS (Atomic Test & Set) is now part of VMFS-5 which improves locking due to an atomic behaviour. ATS is part of VAAI (vSphere Storage APIs for Array Integration) compared to SCSI-2 reservations with VMFS-3.

As you see above since VMFS-5 the file system is using a unified 1 MiB block size which is not longer configurable and a max VMDK size of 62 TB. Very small files smaller than 1 KiB will get stored in the file descriptor location (also known as inode) in the metadata rather than the file blocks itself. Once the 1 KiB limit is reached sub-blocks are getting used with a max of 8 KiB in size and once the 8 KiB size is used it will migrated to the normal block sizes of 1 MiB. Please keep in mind that the number of sub-blocks is limited to 32.000 (4.000 with VMFS-3). Great examples for such small files are .VMSD, .VMXF, .VMX, .NVRAM, and .LOG. files. There is a lot of confusion what will happen with VMDKs and if they are by default 1 MiB. Please keep in mind that the file system itself does not care what filename or type the file has it just sees the size and handles files appropriately. Obviously for most of the VMDKs this will happen at creation but think about the descriptor VMDK for the flat file. This file should not get bigger than 1024 byte and the name of this file is VMDK descriptor file so it really makes sense this is stored in a inode.

So the process would be:

< 1024 Byte = file descriptor location (inode)
> 1024 < 8192 byte = Sub-Blocks
> 8192 byte = 1 MiB Blocks

With vmkfstools you will find out about the usage of files and sub-blocks any many more things:

~ # vmkfstools -Pv 10 /vmfs/volumes/<your_vmfs_volume_name>/

VMFS-5.60 file system spanning 1 partitions.

File system label (if any): <your_vmfs_volume_name>

Mode: public ATS-only

Capacity 805037932544 (767744 file blocks * 1048576), 468339130368 (446643 blocks) avail, max supported file size 69201586814976

Volume Creation Time: Mon Jun 22 16:38:25 2015

Files (max/free): 130000/129472

Ptr Blocks (max/free): 64512/64009

Sub Blocks (max/free): 32000/31668

Secondary Ptr Blocks (max/free): 256/256

File Blocks (overcommit/used/overcommit %): 0/321101/0

Ptr Blocks (overcommit/used/overcommit %): 0/503/0

Sub Blocks (overcommit/used/overcommit %): 0/332/0

Volume Metadata size: 807567360

UUID: 55883a01-dd413d6a-ccee-001b21857010

Logical device: 55883a00-77a0316d-8c4d-001b21857010

Partitions spanned (on "lvm"):

naa.6001405ee3d0593d61f4d3873da453d5:1

Is Native Snapshot Capable: YES

OBJLIB-LIB: ObjLib cleanup done.

WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0

With the command find you can show the number of files and directories:

Files bigger 1024 byte and smaller 8 KiB: ~ # find -size +1024c -size -8192c | wc -l
Files smaller 1 KiB: ~ # find -size -1024c | wc -l
Directories: ~ # find -type d | wc -l

You can also find out with vmkfstools -D (cd to the directory of the VM) what the actual block size of an individual file is (the owner with padded zeroes happens because this host is holding the lock on this files). Below you see the three files vm-flat.vmdk (flat disk), vm.ctk.vmdk (changed block tracking) and vm.vmdk (descriptor file). The flat file has a size of 40 GiB, the ctk file about 2.6 MiB and the vmdk descriptor file 608 byte. You see different values here but for this explanation most important is “nb” which stands for “New Blocks” and shows the allocated space as well as “bs” which stands for block size. The flat file has 17425 new blocks and a blocksize of 1 MiB (about 17425 x 1 MiB allocated), the ctk file 3 new blocks (2621952 bytes = 3 x 1 MiB blocks allocated) and the VMDK descriptor file with 0 new blocks. Wait why 0 new blocks? Because files smaller than 1 KiB are using the inode itself.

~ # ls -lat *.vmdk*
-rw-------    1 root     root   42949672960 Nov 7 17:20 am1ifdc001-flat.vmdk
-rw-------    1 root     root       2621952 Nov 1 13:32 am1ifdc001-ctk.vmdk
-rw-------    1 root     root           608 Nov 1 13:32 am1ifdc001.vmdk

~ # vmkfstools -D am1ifdc001-flat.vmdk

Lock [type 10c00001 offset 189634560 v 45926, hb offset 3723264

gen 3447, mode 1, owner 5811dc4e-4f97b2d6-8112-001b21857010 mtime 282067

num 0 gblnum 0 gblgen 0 gblbrk 0]

Addr <4, 438, 131>, gen 45883, links 1, type reg, flags 0, uid 0, gid 0, mode 600 len 42949672960, nb 17425 tbz 0, cow 0, newSinceEpoch 17425, zla 3, bs 1048576

~ # vmkfstools -D am1ifdc001-ctk.vmdk

Lock [type 10c00001 offset 189646848 v 46049, hb offset 3723264

gen 3447, mode 1, owner 5811dc4e-4f97b2d6-8112-001b21857010 mtime 282071

num 0 gblnum 0 gblgen 0 gblbrk 0]

Addr <4, 438, 137>, gen 45888, links 1, type reg, flags 0, uid 0, gid 0, mode 600 len 2621952, nb 3 tbz 0, cow 0, newSinceEpoch 3, zla 1, bs 1048576

~ # vmkfstools -D am1ifdc001.vmdk

Lock [type 10c00001 offset 189636608 v 45998, hb offset 3723264

gen 3447, mode 0, owner 00000000-00000000-0000-000000000000 mtime 406842

num 0 gblnum 0 gblgen 0 gblbrk 0]

Addr <4, 438, 132>, gen 45884, links 1, type reg, flags 0, uid 0, gid 0, mode 600 len 608, nb 0 tbz 0, cow 0, newSinceEpoch 0, zla 4305, bs 8192

Important to understand that the I/O the VM doing e.g. 4 KiB is not dictating the block size of the VMFS file system. The file descriptor has a fixed number of address slots for data blocks. As soon as the file size grows above what a file descriptor can contain it switches the file descriptor to use pointer blocks and indirect addressing. Each pointer block is 4 KiB in size and can hold 1024 addresses which results using a block size of 1 MiB the host access to 1 GiB. Underneath the VMFS file system you then have a volume based structure as well as the physical media as you see in figure 1 at the beginning of this article. As this works different with almost every storage vendor I don’t get more into detail about it.

NFS

There are several ways to store virtual machine data on a given storage backend. NFS is a solid, mature, high available and high performant storage implementation. It got adapted rapidly in customer environments due to the combination of cost, performance and the easiness of management. The feature set compared to VMFS became similar so there is no reason to not use NFS because of missing features. Obviously there is also nothing against using VMFS and NFS in parallel on a ESXi host or ESXi cluster. NFS is a distributed file system protocol which originally got developed by Sun Microsystems in 1984. It was a very easy way to allow a system to connect via the network to storage without introducing a complete new infrastructure which would be necessary for FC based systems. There are two supported versions of NFS in vSphere 6.0 which is the older NFS 3 and NFS 4.1 but the majority of customers are still using NFS 3 because the feature set is more complete with NFS 3 and the reason to use NFS 4.1 is basically just around security. NFS networks in ESXi are usually layer2 VLAN’s so there is no direct access possible from externally which is another reason to stick with NFS 3. To find out about the differences please follow this link to the VMware vSphere 6.0 Documentation Center or a very good article by vmguru.com about NFS best practices.

But as this article is about block sizes and I/O so let’s switch over to block sizes on NFS based systems. The difference to VMFS is that VMware itself does not format the file system because the file system implementation itself comes from the storage vendor system and the block size is based on the native implementation of the NFS server or NFS array. The block size itself same as with VMFS has no dependency on the Guest VM because the VMDK is simply a file on the the NFS server/array. There are also no sub-blocks existing on NFS. Same as with VMFS you find out about the block size with vmkfstools. As you see below where the NFS server is using a block size of 4 KiB:

~ # vmkfstools -Pv 10 /vmfs/volumes/<your_nfs_volume_name>/

NFS-1.00 file system spanning 1 partitions.

File system label (if any): <your_nfs_volume_name>

Mode: public

Capacity 536870912000 (131072000 file blocks * 4096), 194154864640 (47401090 blocks) avail, max supported file size 18446744073709551615

UUID: fcf60a16-17ceaadb-0000-000000000000

Logical device: 10.14.5.21 /mnt/<your_nfs_mount>/<your_nfs_volume_name>

Partitions spanned (on "notDCS"):

nfs:<your_nfs_volume_name>

NAS VAAI Supported: NO

Is Native Snapshot Capable: NO

OBJLIB-LIB: ObjLib cleanup done.

WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0

Conclusion

I hope this blog article makes sense to you and shows you that there are different levels of block sizes and that the allocation unit size really has nothing to do with the I/O a given application is doing and the VM itself is totally unaware of the block size of VMFS. In my opinion I would also say that it makes sense to keep your environment as much as possible at the default setting and not try to save the last few GB on your VMs using different allocation unit sizes based on the application you are running. At the end standards make more sense vs. the last 1 % you get out of a different configuration. As always if you have any question, recommendation or concern please contact me.