Thursday, October 20, 2016

VMware ESXi locking and how to kill a frozen VM

This blog article is about how to kill a frozen VM. Working in technical support you often get cases where let’s name it: “something went wrong”. This could be a great variety from storage issues or other process running on the particular ESXi which still hold on a lock file as well as many other reasons. This blog article is structured in the way that I start with the different ways how to kill a VM followed by several troubleshooting techniques like: which host has the lock, is there eventually an APD (All Path Down) or even PDL (Permanent Device Loss) happening.


Let me first get into what a VM is. The following figure explains certain relevant processes within ESXi:




Figure 1: Processes in ESXi


You can classify these processes into different groups as they come from a different context. In the upper left you see the Virtual Machine running in User World. Followed by the several host processes here just two as an example: hostd & Shell.

ESXi Shell

It is important to understand that ESXi is not based on Linux. I haven’t heard this recently but not so far along quite a number of people tried to tell my that ESXi is based on Linux just because of all the typical shell commands (grep, less, more, ps, cat, etc.) are available as well as a similar behaviour like using SSH for remote access. If ESXi is not a Linux based how are we getting these Linux/Unix based commands then? Well this is easy to explain as VMware decided to use a software implementation with the name “Busybox” - a lightweight shell VMware implemented to execute the typical shell commands. Busybox runs as a userworld process on top of VMkernel. So basically it is similar to cygwin on Windows. Whenever you see Busybox in your shell you now know why this is the case.


The command "ps" (process status) gives you a good idea what subcomponents of a particular VM you have. When you have an issue with a frozen VM it is good to know the following two things:


  • What is the VM status
  • Who holds the lock of the VM
  • Are there Commands in Flight (CIF) because of an issue in the Kernel or Storage Stack (will discuss in one of my next blog articles)

Kill a frozen VM

There are many ways how to kill a frozen VM. The following four examples show different ways how to kill a VM. I will go through examples to help you identifying the correct data you need to stop/kill a VM. There is no good or bad which tool you should use I just want to show the variety of what is possible. After the killing part of this blog article I proceed with some explanation about locking in VMFS and NFS.


  • Shell tools (ps & kill)
  • esxcli
  • vim-cmd
  • ESXtop
  1. Use “ps” in combination with “kill”

To find the status of a VM use the following "ps" command (in this example the VM name is am1ifvh029). As you see the VM has 8 x vCPUs which explains the 8 x vmm0-7 and vmx-vcpu-0-7 as well as vthread, mks and svga. The last column shows the Group ID (GID) which is the main process.


~ # ps -jv  | egrep "WID|am1ifvh029"


WID      CID      WorldName                     GID
645172   0        vmm0:am1ifvh029               645137
645174   0        vmm1:am1ifvh029               645137
645175   0        vmm2:am1ifvh029               645137
645176   0        vmm3:am1ifvh029               645137
645177   0        vmm4:am1ifvh029               645137
645178   0        vmm5:am1ifvh029               645137
645180   0        vmm6:am1ifvh029               645137
645181   0        vmm7:am1ifvh029               645137
645219   645137   vmx-vthread-13:am1ifvh029     645137
645220   645137   vmx-mks:am1ifvh029            645137
645221   645137   vmx-svga:am1ifvh029           645137
645222   645137   vmx-vcpu-0:am1ifvh029         645137
645223   645137   vmx-vcpu-1:am1ifvh029         645137
645224   645137   vmx-vcpu-2:am1ifvh029         645137
645225   645137   vmx-vcpu-3:am1ifvh029         645137
645226   645137   vmx-vcpu-4:am1ifvh029         645137
645227   645137   vmx-vcpu-5:am1ifvh029         645137
645228   645137   vmx-vcpu-6:am1ifvh029         645137
645229   645137   vmx-vcpu-7:am1ifvh029         645137


To kill this process use the kill with the -9 option. If you like to get more into the depth of different options with kill please follow this link. Basically -9 means that the kernel will let go of the process without informing the process itself. This could theoretically result in data loss depended on what the process is doing and is the hardest way of killing. You can always try first “kill -1” (sends a hang-up signal to the process), “kill -2” (comparable to CTRL+C) during an execution.


~ # kill -9 645137


Just be aware that the “kill” command does not ask you for anything and just kills these processes if possible. If you trigger the ps command then if running again you should not get the Process ID back back.
  1. Use “esxcli vm process”

The esxcli command is available on ESXi to manage the Hypervisor on the low level infrastructure. Unlike the further down explained vim-cmd command it focuses purely on the underlying infrastructure of ESXi. It looks that it is only one command (esxcli) but it has a broad variety of subcommands using different namespaces.The nice thing and a big advantage to the previous used esxcfg- commands is that they are organized in a tree hierarchy. You can always put a command into the Shell and see all options available. A good combination of commands you find in this VMware KB: 2012964 and how esxcli compares to vim-cmd and PowerCLI. Before you can stop a process you have to know what status is in use. A very good blog article about esxcli you find here written from Steve Jin.


~ # esxcli vm process list | grep -i -A 4 am1ifvh029


am1ifvh029
 World ID: 645172
 Process ID: 0
 VMX Cartel ID: 645137
 UUID: 42 21 23 10 79 c5 62 80-9b 06 74 21 81 9a fc 57
 Display Name: am1ifvh029
 Config File: /vmfs/volumes/55883a14-21a51000-d5e9-001b21857010/am1ifvh029/am1ifvh029.vmx


To kill the VM you have to use the “World ID”. There are different options (--type or -t) how to kill a world:


  • soft
  • hard
  • force


~ # esxcli vm process kill -t=soft -w "645172"


In case nothing happens with the default “soft” kill please try “hard” or “force”. As you see in example 1 with the command “ps” you will find the primary World ID as well. This is always the ID of vmm0.
  1. Use “vim-cmd” to kill a VM

Another tool to get the status and stop the VM is vim-cmd. It is build on top of “hostd” and very hostd like to use and basically the API integration for ESXi. You can use vim-cmd for many operational tasks. Another great blog article like the esxcli above one Steve Jin wrote here.


vim-cmd in ESXi is sitting at /bin/vim-cmd but it is actually a symbolic link to hostd itself:


~ # ls -l /bin/vim-cmd
lrwxrwxrwx    1 root     root    10 Mar  4  2016 /bin/vim-cmd -> /bin/hostd


The vim-cmd has a few sub-commands. To find out what they are just put in the vim-cmd in the shell:


~ # vim-cmd
Commands available under /:
hbrsvc/       internalsvc/  solo/         vmsvc/
hostsvc/      proxysvc/     vimsvc/       help


As you see there 7 sub-commands (plus help). You can imagine what they are for as you know that there are many options and functionality is included in ESXi. If you remove the svc (service) there are basically commands for: hbr, internal, solo, vm, host, proxy, vim and help. Please keep in mind that internalsvc does not really mean that this is an internal API of ESXi.


As you can imagine we want to do something around VMs so we have to find the correct “vmsvc” command. When you trigger just vim-cmd vmsvc you will get the following:


~ # vim-cmd vmsvc


Commands available under vmsvc/:
acquiremksticket                 get.snapshotinfo
acquireticket                    get.spaceNeededForConsolidation
connect                          get.summary
convert.toTemplate               get.tasklist
convert.toVm                     getallvms
createdummyvm                    gethostconstraints
destroy                          login
device.connection                logout
device.connusbdev                message
device.ctlradd                   power.getstate
device.ctlrremove                power.hibernate
device.disconnusbdev             power.off
device.diskadd                   power.on
device.diskaddexisting           power.reboot
device.diskremove                power.reset
device.getdevices                power.shutdown
device.toolsSyncSet              power.suspend
device.vmiadd                    power.suspendResume
device.vmiremove                 queryftcompat
devices.createnic                reload
get.capability                   setscreenres
get.config                       snapshot.create
get.config.cpuidmask             snapshot.dumpoption
get.configoption                 snapshot.get
get.datastores                   snapshot.remove
get.disabledmethods              snapshot.removeall
get.environment                  snapshot.revert
get.filelayout                   snapshot.setoption
get.filelayoutex                 tools.cancelinstall
get.guest                        tools.install
get.guestheartbeatStatus         tools.upgrade
get.managedentitystatus          unregister
get.networks                     upgrade
get.runtime


As we want to focus how to kill a VM we now need the status of the actual vmid:
~ # vim-cmd vmsvc/getallvms | grep -i 'vmid\|am1ifvh028' | awk '{print $1,$2}'


Vmid Name
4    am1ifvh028


To get the current power state please use the following command:
~ # vim-cmd vmsvc/power.getstate 4


And you now get the output that the VM is running.
Retrieved runtime info
Powered on


To power of a VM with vim-cmd run this command:
~ # vim-cmd vmsvc/power.off 4
  1. Use ESXtop to kill a VM

Run the esxtop utility by running this command.
  1. Run esxtop (esxtop always starts in the CPU view. If you are in a different view please press “c” to switch to the CPU resource utilization screen).
  2. Press “Shift+v” to limit the view to virtual machines. This makes the view much easier to read as otherwise you will see all processes and eventually you won’t see your VM at all.
  3. Press “f” to display the list of fields.
  4. Press “c” to add the column for the “Leader World ID”. This is needed to identify the ID you need to kill the VM.
  5. Identify the target virtual machine by its Name and Leader World ID (LWID).
  6. Press “k”.
  7. Now you see: World to kill (WID): <WID>. Type in the LWID from step 5 and press Enter.
  8. A few seconds later the process should be gone.

Locking in ESXi

But what can you do if all of the above examples won’t help? Is the VM eventually locked by another host? If the VM is not stalled, but showing and inaccessible state, it is likely that there is a lock holding the VM from running on the host it is currently on. In this case all the above examples won’t and eventually another host is not giving up a lock on the host. It is always good to know where the VM lived in the past. With the following command you find where the VM was registered using the vmware.logs.


~ # find /vmfs/volume -name <vmname>
/vmfs/volumes/<DatastoreUUID>/<vmname>


After the find just change to that directory or trigger the grep with the path where you currently are:


~ # grep -i hostname vmware*
vmware-188.log:2016-08-11T14:45:26.065Z| vmx| I120: Hostname=am1ifvh004
vmware-189.log:2016-08-25T14:10:15.054Z| vmx| I120: Hostname=am1ifvh003
vmware-190.log:2016-09-02T01:39:45.934Z| vmx| I120: Hostname=am1ifvh003
vmware-191.log:2016-09-13T05:31:17.699Z| vmx| I120: Hostname=am1ifvh003
vmware-192.log:2016-09-13T15:55:42.495Z| vmx| I120: Hostname=am1ifvh003
vmware-193.log:2016-10-07T15:59:35.317Z| vmx| I120: Hostname=am1ifvh004
vmware.log:2016-10-10T17:04:38.627Z| vmx| I120: Hostname=am1ifvh003


You see that the VM is running on host am1ifvh003 since 2016-10-10T17:04:38.627Z.


Another way to find out in which datastore the VM lives is the already known esxcli command. Please use the following example on one of the hosts in your cluster with access to the Datastore and where the VM is registered. In case vCenter is down you can take the above example to find out where the VM was registered last. There are two ways how to find information where the .vmx file lives as this is the file holding the .lck file from ESXi:


  • Use esxcli to find out where the config file lives:


~ # esxcli vm process list | grep -i -A 4 <vmname> | grep -i 'Config File' | awk '{print $3}'


--> /vmfs/volumes/<DatastoreUUID>/<vmname>/<vmname>.vmx


  • When the process is already half dead you eventually won’t get something useful so the lsof command is maybe helping you here a bit better.


~ # lsof | grep -i <vmname>.vmx.lck | awk '{print $NF}'


--> /vmfs/volumes/<DatastoreUUID>/<vmname>/<vmname>.vmx.lck


VMFS lock instructions

  1. The first thing you have to do is to change the directory to the VM you want to check who is holding the lock.


~# cd /vmfs/volumes/<DatastoreName/<UUID>/<vmname>/


  1. Then use vmkfstools -D to find out two things:


  • Which MAC address is holding the lock
  • Which offset has the file


~# vmkfstools -D <vmname>.vmx.lck
Lock [type 10c00001 offset 189607936 v 46492, hb offset 3723264
gen 3377, mode 1, owner 57f7c8e2-8f5d86e3-efc8-001b21857010 mtime 110695
num 0 gblnum 0 gblgen 0 gblbrk 0]
Addr <4, 438, 118>, gen 46491, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 0, nb 0 tbz 0, cow 0, newSinceEpoch 0, zla 4305, bs 8192


  1. The first and easier one is to just take the last part of the owner ID 001b21857010 which relates to the MAC address of one NIC of the host holding the lock. Use "esxcli network nic list" to find out who is the owner and then use your c# vSphere Client, Webclient or the Shell to find out who is the current owner of the <vmname>.vmx.lck file.
~# esxcli network nic list | awk '{print $1,$8}


Name Status
------ -----------------
vmnic0 38:63:bb:3f:19:48
vmnic1 38:63:bb:3f:19:49
vmnic2 38:63:bb:3f:19:4a
vmnic3 38:63:bb:3f:19:4b
vmnic4 00:1b:21:85:70:10
vmnic5 00:1b:21:85:70:11


  1. The second option you have to use if the owner ID is zeroed out. In this case we have to use the file offset of the <vmname>.vmx.lck. try the following command:


~# hexdump -C /vmfs/volumes/<datastore>/.vh.sf -n 512 -s <offset>


The datastore is the datastore where the VM resides so please remember to go one level back to the datastore level. The offset is value from previous command (3723264 above).


  1. Use the HB offset from the output (highlighted in yellow) and run this command to get the MAC address of the ESX/ESXi host with the lock:


~#  hexdump -C /vmfs/volumes/<datastore>/.vh.sf -n 512 -s <3723264>


0038d000  02 ef cd ab 00 d0 38 00  00 00 00 00 31 0d 00 00  |......8.....1...|
0038d010  00 00 00 00 fa 0f e1 f5  ee 00 00 00 e2 c8 f7 57  |...............W|
0038d020  e3 86 5d 8f c8 ef 00 1b  21 85 70 10 81 d1 0c 01  |..].....!.p.....|
0038d030  0e 00 00 00 3d 04 00 00  00 00 00 00 00 00 00 00  |....=...........|
0038d040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0038d200


The 7th to 12th byte contains the MAC address: 00 1b  21 85 70 10.


Then again use "esxcli network nic list" to find out who is the owner. This can be one of the physical NICs of a particular host. If you now find out that the VM is registered on another host who is holding the lock it is advised to migrate the VM back to the host who hold the lock and then try to start the VM. Please keep in mind that you should put DRS to manual so the VM won’t get started on another host by mistake. In case this is all not possible you last chance is to reboot the host holding the lock.


NFS lock instructions

Finding who holds the lock with NFS works a little bit different due to the nature of being a file based protocol.


  1. Navigate to the directory of the VM (you can use the same command like "esxcli vm process list" to find out where the VM lives.)


  1. Different to VMFS every operational file has also a corresponding .lck file. Taking a VM with quite a few of VMDKs shows that there are a good number of .lck files. So how to find out which one is now the one for the .vmx.lck? Let’s take ".lck-3409000000000000" as an example.
~# ls -lA | grep .lck-


-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-3409000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-3d01000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-4801000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-5301000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-5e01000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-6901000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-7401000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-7f01000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-8a01000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-9501000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-a001000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-ab01000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-e201000000000000
-rwxrwxr-x    1 root     root            84 Oct 19 13:29 .lck-f208000000000000


  1. Use the hexdump command to interrogate the .lck file for the hostname
~# hexdump -C .lck-3409000000000000


00000000  fd 79 97 00 00 00 00 00  23 01 cd ab ff ff ff ff  |.y......#.......|
00000010  01 00 00 00 61 6d 31 69  66 76 68 30 30 33 00 00  |....am1ifvh003..|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000030  00 00 00 00 57 ac 79 10  71 18 c3 9e f3 16 00 1b  |....W.y.q.......|
00000040  21 85 70 10 00 00 00 00  00 00 00 00 00 00 00 00  |!.p.............|
00000050  00 00 00 ff                                       |....|
00000054


  1. In the above example, we can see this lock was being held for am1ifvh003. This great that we know now for which host this .lck file is but we still don’t know for which file ".lck-3409000000000000" is. What needs to happen now is a reversing of the endianness of this .lck file you see in the figure below.
Figure 2: Big Endian to Little Endian translation


  1. The next step is to convert hex to decimal. In our example the endian is filled up with zeroes which you won’t need. So the translations would be: 0x934 = 2356 (dezimal)
  2. Now you can use the following command to find which inode this refers to:
~# stat * | grep -B2 2356 | grep File
File: am1ifpt002.vmx.lck


So the first .lck- file refers to the <vmname>.vmx.lck file.


  1. You may also leverage this command to do the same thing automatically on a live ESX host (in this example you won’t need to reverse the endian):


~# stat * | grep -B2 `v2=$(v1=.lck-3409000000000000;echo ${v1:13:2}${v1:11:2}${v1:9:2}${v1:7:2}${v1:5:2});printf "%d\n" 0x$v2` | grep File


File: am1ifpt002.vmx.lck

Conclusion

There are many reasons why a VM could be frozen. With the ability to find out who holds the lock as well as different methods how to kill a VM you should be able to fix a frozen VM problem for many cases. Obviously there are other reasons like I explained at the beginning like a problem with the storage system or issues with SCSI reservations, bogus Inode numbers because of a BUG of your storage system etc. As always please let me know if you like my blog and if you have any suggestions or recommendations please don’t hesitate to contact me.

8 comments:

  1. You forgot one other way: Reboot the ESXi server :-)

    ReplyDelete
    Replies
    1. Hey Nelson :). True I concluded that people would do that in that case anyway.

      Delete
  2. Thank you for sharing such a nice and interesting blog with us. i have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle.
    VMwareTraining in Chennai

    ReplyDelete
    Replies
    1. Thanks Sathya! Please suggest my blog :). I hope I will have time soon again to write new articles. Thanks for your support!

      Delete
  3. We have to write for this article topics is really good.Well selected contact and write this content is amazing.You have blessed person for a content writer.
    Selenium Training in Chennai

    ReplyDelete
  4. Great to read this article, being informative and helpful to know about VMware. To find vmware jobs Vmware Jobs in Hyderabad.

    ReplyDelete
  5. I really appreciate the information shared above. It’s of great help. If someone wants to learn Online (Virtual) instructor lead live training in VMware TECHNOLOGY, kindly Contact MaxMunus
    MaxMunus Offer World Class Virtual Instructor-led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 1,00,000 + training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Pratik Shekhar
    MaxMunus
    E-mail: pratik@maxmunus.com
    Ph:(0) +91 9066268701
    www.MaxMunus.com

    ReplyDelete
  6. Helpful as always. Every post you write produce a massive value to your readers that is the only reason it is so popular and has great authority.

    VMWare Training in Chennai

    MSBI Training in Chennai

    ReplyDelete