I decided I should start writing about some of my findings in the CRL or in my homelab when it is relevant to what we do here. It would be beneficial to those looking to learn more about working with server software and hardware.
This is NOT meant to scare anyone off, even if sometimes these turn into miniature rants. I’ll try my best to explain my thought process and troubleshooting.
Some of these things I will try to cross-post on my own website at https://jtcressy.net
Somehow, the vcenter server magically stopped starting the vpxd service. Rather, I found this out after suddenly getting 503 errors for the vpxd webserver pipe when trying to access the main web client. Starting it manually just resulted in it crashing repeatedly. (You can SSH into the vcenter appliance with the root account and run ‘service-control –start vmware-vpxd’ or get it’s status by running ‘service-control –status vmware-vpxd’).
What I needed to do was go look at the logs. After a bit of googling, I found out that they were located in /var/log/vmware/vpxd/vpxd.log. I would recommend looking at the log with either ‘less’ or ‘vim’ so that you can scroll up aswell as down. You can also look for the word ‘error’ by typing ‘?error’ in vim or less. It should just be at the bottom of the log since it dumps a crash report and dies. I found some key words in the error such as “pk_vpx_vm_virtual_device” right after the words “duplicate key value violates unique constraint”.
Googling this for a little while gave me this:
- This is a bug in 6.5 that is fixed if you are updated to 6.5b (wasnt quite up to date yet, lol)
- Vmware is trying to add its inventory to the database and is failing because an entry is already there
- Usually a result of changing a vm while vcenter is down or offline. It could even be as small as powering on or off a vm.
- Workaround is to remove the offending VM from the inventory of the host it’s residing on and re add it later once vcenter is working
One small problem: I have no idea what VM it is.
Now, i looked up a little further to find out what data it was trying to write to the table. I saw somewhat useless information like what type of device it was and other variables for the device, but I wanted to find a field that linked it back to the main virtual machine, that turned out to be the ID field.
So, ok, we have a VM ID, but how do we find out what vm it is? the hosts don’t keep a record of what ID a vm has in vcenter. This ID also pertains to what vmware calls a “Managed Object Reference”. For example, vmrc url’s use “?moid=vm-xxx” to specify what vm to connect to over vmrc. Thus, i need to query the database for the VM id in that field.
While in the terminal of the appliance, you can invoke the postgres tool to manipulate the database. it allows you to connect without a password as long as you use the “postgres” username. The ‘psql’ binary is not in the PATH, so we have to invoke it directly with ‘/opt/vmware/vpostgres/current/bin/psql’ and to connect and use the postgres user, run ‘/opt/vmware/vpostgres/current/bin/psql -U postgres’.
Now that i’m in the interactive prompt, i can check that the VCDB exists with ‘\l’ then switch to that database by typing ‘\connect VCDB’.
Query the table ‘vpx_vm’ (list tables with ‘\dt’ and check my spelling) and filter for the VM’s ID by running “SELECT * FROM vpx_vm where id = ‘<vmid>’;” and replace <vmid> with the ID you found in the vpxd log. In my case the ID was 784 and I was able to scroll through the output far enough to catch a glimpse of the VM name. I then checked all of my hosts for that specific VM and powered off then unregistered the vm. If you are doing this in a huge datacenter with hundreds of hosts, I feel sorry for you. Hopefully you’d have a sane HA setup that allows you to afford the time to fix it. You’d be better off calling vmware support in a deployment like that.
DNS Failover – NOT!
vCenter appears to ignore the secondary dns server in the configuration. This means if your first DNS server goes down, vcenter will flip out. Hosts will disconnect, services will fail to start, etc. A workaround besides fixing that broken DNS server is to go to the configuration at https://<yourvcenter>:5480 and set the primary DNS to your other working DNS server and FULLY REBOOT vCenter.
Another way to remedy this would to be to run a number of servers in a failover IP configuration where a single IP address points to all servers in a cluster.
Shoutout to vmware for making us use complicated workaround setups.