VMware: vSAN Disk Group Cache Drive Dead or Error (VSAN Absent Disk)

Summary:
A cache disk failed in my host taking along with it the disk group.  This is expected behavior, but for some reason, the disk group also disappeared from GUI so I couldn't decommission the disk group to basically replace the cached drive.  So, had to do it through powercli/esxcli.  Wish I took a screenshot, cause it was kind of annoying.

PowerCLI Example:

Once you've deleted the offending disk group, you can now create a new disk group utilizing the replaced cache disk and former capacity disks.

VMware: vxlan to vxlan traffic randomly fails or only works on the same ESXi host...

Summary:
Here are the basics:

  • Leaf/Spine Architecture (Basic illustration only show ToRs)
    • Basic Illustration for explanation purposes
  • vSphere 6.5U1 / vSAN 6.6
  • NSX 6.3.3
    • Multi-VTEP Deployment w/ LoadBalance-SRCID
    • Standard VLAN for VTEP connections.
  • 2x Nexus 9K ToRs
  • Dell R630's
Long story short, Switch vPC's were stripping VLAN ID info before sending to peer ToR then to ESXi host.  ESXi host dumped it causing these strange issues.  Load Balance SrcID w/ Multi-VTEP made this especially difficult to figure out because of the basic randomness.  Switch vPC link has a configuration advantage, so in order to keep it, we ran additional links between the switches to make some standard trunk connections.  Once done, we configured our NSX VTEP VLAN network to traverse those trunk connections rather than the vPC.  This resolved our stripping issue.

See past page break for tools and more details on what we (mostly vmware NSX senior support staff) did to figure this out.
[FYI: Cisco recommendations appear to be only to use vPC between switches if the downstream host links utilize port channel (LACP) as well.  There are factors in play in the larger scheme of the network fabric, but this is from the viewpoint of a compute engineer.]

VMware: Integrating OpenLDAP into SSO/PSC over LDAPS

Summary:
Quite simply was trying to get an OpenLDAP identity source added to SSO/PSC.  Would work fine using non-secure LDAP, but seemed to have issues when attempting to utilize secure LDAPS.  Simple error, nondescript basically describing that it failed.

Heres what happened in my case:

  1. I had two server URL's defined for my target LDAP servers.
    • OpenLDAP Config Screen
  2. Since I had the "Protect LDAP communications..." box checked, the next step requires me to either upload the target system's certs and their authoritative chains (think Root Certificate Authority (CA) and Intermediate CA's)
    • If you can, uploading the needed certs would save some time, otherwise you can continue w/ my outlined steps below assuming the spyglass icon works in the same fashion for you.
  3. The cert upload screen has a little spyglass icon that'll pull it down for you, but in my case it would only pull the primary server's cert and associated CA certs.  It would not pull the secondary for some reason.
  4. If I went forward anyway at this point, it would fail.  So I went back a screen, and flipped primary and secondary URL entries, then back to the cert upload screen and hit the spyglass icon again.
  5. Interestingly it pulled the secondary's (now that it was primary) cert now w/ the same associated CA.
  6. I deleted the duplicate CA entry, went back and flipped the primary and secondary back, and finished the wizard successfully.
    • Cert Upload Screen



Misc: Fire TV Stick 2 Screen Cut Off, no display adjustment option


Summary:
Long story short, Fire TV Stick (2nd Generation) doesn't allow you to calibrate the screen, software wise, forcing you to use your TV settings (if available) to fit the screens content within it's borders.  Super annoying for some apps, but you can fix this dumb issue as I was able to.

If your TV allows you to adjust its settings, then you'd be fine as well, but the Vizio I have, has no such setting.

Workaround:
  • First you have to enable ADB debugging (step 1 here) on your fire stick.
  • Next you'll need the ADB provided w/ Android Studio, or you can install w/ brew on the Mac.
    • I prefer brew cause it's easier, keep following steps described by Amazon page if you are using Windows.
    • Open Terminal
    • brew cask install android-platform-tools
  • Next find the IP address of your Firestick
    • Settings --> Device --> About --> Network
Steps after ADB is installed:
  1. Connect to your fire stick using adb
    • adb connect <IPAddressofYourFireStick>
    • For example: adb connect 192.168.20.35
  2. If successful, you should see a return of something like this:
    • connected to <IPAddressofYourFireStick>:5555
  3. Now to adjust, these settings worked for my Vizio VX32L:
    • adb shell wm overscan 65,40,60,28
      • This is what the values stand for:
      • wm overscan LEFT, TOP, RIGHT, BOTTOM-Margin
  4. To see the changes you have reboot the stick:
    • adb shell reboot
  5. To verify after reboot go to:
    • Settings --> Display & Sounds --> Display --> Calibrate Display
  6. Repeat Steps 1-5 until your display is calibrated to your particular TVs personality.
Reference:
Last comment by AmazingNick is what helped get me on the right path.

VMware: vSphere Scheduled Tasks w/ PowerCLI (not to be confused w/ Windows scheduled tasks)


Summary:
Question was posted in the communities on how to find scheduled tasks configured against a VM.  I remembered doing it long ago, but I never posted about it.  Also found it weirdly hard to find via Google, so I'm posting here for my own reference or anyone else needing it for that matter.

Example:

VMware: Migrating Management(Mgmt) vmk to DVS/VDS fails when moving both vmnic and vmk at the same time.

Summary:
Quite simple, had a script to move physical nics to DVS/VDS w/ management vmk at the same time.  Typically this works w/o issue, but for some reason kept failing.  The answer was dead simple...

Resolution/Workaround:
  1. Spanning Tree Enabled?
    • Enable portfast on the switch ports.
  • Or
  1. Spanning Tree not available?
    1. Move one physical link at a time (assuming more than one physical link available)
    2. Wait for uplink on DVS to come online, then move management/mgmt vmk
Explanation:
Basically, the switch ports that the ESXi servers were uplinked to did not have 'portfast' (physical switchside config) enabled.  Without 'portfast', when moving a physical nic from a standard vSwitch (or vice versa), there is a negotiation downtime the host incurs as the switch/host essentially renegotiates the connectivity.  It's a short window (5-10 sec) that the port goes 'offline', but it's enough for the migration of vmk and physical nics at the same time to fail.

Example PowerCLI Snippet:

VMware: vSAN 6.6 not showing all available disks when attempting to claim...

Summary:
Was going through and attempting to setup new vSAN cluster but noticed that the wizard was only showing 3 of 4 disks from 3 of 4 hosts and 0 disks from another host.  This appears to be by design where the setup wizard will only target disks that have 0 partitions.  Makes sense.

This, however, is not obvious in the setup.

Solution:
Simply delete any partitions from those disks that you'd like to have vSAN claim.  You can do this enmasse via PowerCLI or the Web Client interface (as pictured below).
[Warning: This is a destructive process so be sure that you know absolutely for certain that you are targeting the correct storage devices.  This is especially true if you plan to script this process.]


Erase Partition in Web Client
The above process would suck if you were doing it against a large cluster, so learn to do it in powershell or some other automated method.

PowerCLI Method:
$TCluster = Get-Cluster TargetClusterName
$TVMHosts = $TCluster | Get-VMHost | Get-View
Foreach ($VMHost in $TVMHosts)
{
    $ConfigManager = Get-View $VMHost.ConfigManager.StorageSystem
    #Spec defined and left blank to clear partitions
    $Spec = New-Object vmware.vim.hostdiskpartitionspec
    #I'm simply targeting all naa devices and those that state local disk. 
    #Reality is that you'd probably want a more in depth filter on the devices you target. 
    #My case was a new set of hosts, so this worked for me.
    $TargetDisks = $VMHost.config.StorageDevice.scsilun | Where {$_.DevicePath -match "naa." -and $_.LocalDisk -eq "true"}
    Foreach ($Disk in $TargetDisks)
    {
        $ConfigManager.UpdateDiskPartitions($Disk.DevicePath, $Spec)
    }
}
Side Note:
vSAN claimed disks have a protection mechanism against being erased via above defined method.  Any partitions that it runs into claimed by vSAN will be met w/ an exception of "Cannot change the host configuration"
If you for some reason need to delete those partitions, then you'll like have to try this method:

vSAN: Rebuilding an ESXi host that has vSAN claimed disks...

VMware/Security: Opvizor OpBot, cool, but scary too.

I've posted about OpBot in the past w/ a brief overview on how you can setup and deploy.  It's a very cool and immensely useful tool.  However, I must balance this with security.  Responsibly deployed, it can be a very useful tool.  However, there is a dark side to this from a security management perspective.  It also poses the very real risk for allowing generic internet access from within your datacenter.

First off, OpBot from Opvizor makes it very clear that you should only grant it's integration account read-only access.  You can do 'destructive' PowerCLI commands by passing login info via slack, but also not recommended.  As much as they have created an immensely useful tool, it also is somewhat of a pandora's box.  It's brought to light a security hole that can be difficult to secure at scale.  Currently Opvizor is the only one that I know of that makes this type of appliance, but that doesn't stop the many possible clones of this type of tech.

Basically what's happened is that it's a method in which a malicious VMware admin could deploy said appliance, give it an elevated service account (AD or otherwise) and no one would be any the wiser.  Now to be clear, a VMware admin should never be deploying things into a datacenter w/o a proper change/audit control process.  In the very least, anything deployed should be well documented and known.

NSX helps in this aspect w/ micro-segmentation.  Everything placed into service receives a specific policy and can communicate w/ only what is needed.  However, it'll only help as far as the security is implemented.  If complete outbound internet is open as a 'standard', then you've effectively enabled OpBot or things like it unfettered access.  First knee-jerk reaction is likely blocking Slack connectivity unless specifically enabled for said purpose.  However, this only guarantees to a "Slack", this does not protect from slack clones or the like.

Solution?:
It's not super simple, but here are some thoughts (for VMware solutions specifically):
  1. Audit/Change Control over Identity Management System (Active Directory) and whatnot.
    1. Any new service/shared account created should be immensely scrutinized.
    2. Change Auditor is a pretty good tool for this.
  2. Audit/Change Control to "Roles" in vCenter (Log Insight can help somewhat in this aspect, Hytrust CloudControl would give you a workflow engine in addition to audit capabilities.)
    1. Basically any account granted an 'admin-type' role should be alerted upon w/o an a peer-reviewed change control system.
    2. Any new role implemented should also be scrutinized for scope and alerting/monitoring put in place for 'high-risk' type roles.
    3. Any change to role permissions scrutinized as well.
  3. Audit/Change Control over passwords for 'service/shared' accounts. (Hytrust Cloud Control includes password vaulting for ESXi hosts)
    1. Password Repo such as LastPass/1Password/OneIdentity, etc.
    2. No single or group of people should actually EVER know by memory service/shared account passwords.
    3. Passwords should be changed based upon audit of password repo access when an employee leaves the company.
      • This would hopefully mitigate a time-consuming process of changing all passwords that said employee may or may not have used.
    4. Password Repo should have complete audit trail as well as alerts for specific types of access.
      1. More advanced, you could use the password repo system to change passwords automatically after a 'manual' checkout scenario.
      2. HyTrust does this for ESXi root passwords automatically.
  4. Network Security/Audit/Change Control (Palo Alto App ID Security)
    1. Subscribe to the mantra of trust nothing in or out.
    2. Peer Review all changes.
    3. Access to vCenter via NSX security policies audit/change workflow.
      1. Anything allowed access to vCenter should be audited.
    4. Palo Alto Firewalls can add an extra layer of heuristics type security to block anything not defined as allowable outside of just ports using something like app id.
Minimally, HyTrust CloudControl could mitigate a large amount of risk for a Slack type bot by using its workflow engine, however none of this really matters if you don't have a proper process behind it.  It may also not mitigate proper Identity Management controls.

Bottom Line:
This is a trust problem, however, this is why security, auditing, and change control processes are essential.  It's not a matter of simply disallowing useful tools, such as Opbot, for the sake of security.  It's about being smart and 'knowing' what's happening in your environment so you can implement productive tools to move the business forward all while being secure and safe.

Visual Aid:

VMware: Invalid Configuration for device # when deploying OVF/OVA...

Summary:
Ran into this message when attempting to import an OVF/OVA to vCenter via Web Client from a Mac.  Not all OVA/OVF's have this issue.

Workaround(s):
  • Upload and deploy from a Windows system
    • OR
  • Upload and deploy to a local datastore if available.
    • OR
  • Use OVFTool to deploy
    • Example:
      • ovftool -ds=NameofTargetDatastore -n=NameYouWantVMtoBe --acceptAllEulas --net:bridged=NameofDVSorStdPortGroupYouWantVMattachedTo C:\Path\Turbonomic.ova vi://username%40mysubdomain.myrootdomain.suffix@vCenterNameorIP/virtualDatacenterName/host/ClusterName
        • %40 translates the @ symbol for the OVFTool if you need to authenticate using standard AD UPN or SSO domain user.
        • If Linux/Mac, replace C:\Path\Turbonomic.ova with /Path/Your.ova
        • -net:bridged switch is optional and can also be different depending on how the OVF has that parameter defined.
        • Target is Cluster assumes DRS enabled, go one further down and put hostname after cluster if DRS is not available.
    • OR
  • Use Import-vApp cmdlet from PowerCLI
    • Example:
      • $OVAConfig = Get-OVFConfiguration C:\Path\Turbonomic.ova
        • $OVAConfig.NetworkMapping.NAT.Config = "NameofVMPortGroup"
          • This particular setting is VERY specific to the Turbonomic OVA.  Other OVA's may have several other configurations/properties you may need to provide.
      • $TargetCluster = Get-Cluster NameofCluster
      • Import-vApp -Source C:\Path\Turbonomic.ova -VMHost ($TargetCluster | Get-VMHost | Select -First 1) -Datastore ($TargetCluster | Get-Datastore NameofDatastoreYouWant) -Name NameYouWantRegistered -OVFConfiguration $OVAConfig
        • This cmdlet requires a vmhost target, this example shows how you can target a cluster and have it deploy to first host in the cluster.
        • This demonstrates how you can target a datastore that belongs to the cluster you are targeting for deployment by name.
        • Name you want the OVA to be.
        • OVF Configuration specified to be passed.
Details:
Specifically ran into this deploying to a datastore backed by a FC array.  It ONLY fails when attempting to deploy from MacOS to an FC backed datastore using the web client.  Targeting a locally backed datastore worked fine.  I could deploy just fine from a Windows systems to that same FC backed datastore.  Seems to be a bug w/ Mac VMware Client Integration Plugin at least w/ 6.0 version.


Turbonomic: Network keeps dying when using static IP...


Summary:
Deployed a new Turbonomic OVA 5.8.3 for some testing.  Logged into appliance via console, ran 'ipsetup' as instructed w/ 'static' selected.  VM stayed online for about 5 min. before dying.

Workaround:
  1. Assuming DHCP is not an option, you simply need to change the 'bootpromo' entry from 'static' to 'none'  in the /etc/sysconfig/network-scripts/ifcfg-eth0 configuration file.
  2. You may also need to kill the dhcp client via killing network manager
    • systemctl stop NetworkManager
    • chkconfig NetworkManager off
      • "NetworkManager" is case sensitive
    • systemctl restart network
  • OR
  1. You can utilize nmtui to modify system eth0 configuration.
    • If issues persist, utilize above steps.
Details:
At some point it was likely that Turbonomic upgraded their OS instance, but failed to take into account a change in the OS' option 'static' being no longer a valid value and has been replaced w/ 'none'.  Seems to affect newer versions of linux OS'.  This will likely be fixed sooner rather than later.  Also Network Manager's DHCP client seems to start because of network manager and kill what existing config there is.

Powershell: How to get REST API data in JSON format rather than XML using invoke-restmethod

Summary:
I was exploring a REST API interface for an internal tool being built.  Being that I'm so accustomed to powershell, I wanted to explore how I could get data from it.  The Invoke-RestMethod is perfect for this, but I was having issues getting data back in straight json format.  Data kept coming back in ugly as hell xml format by default.

Details:
The short answer was that I need to make a hash table to pass to the -Header parameter of the invoke-method cmdlet.  Basically, it looks like this:

$Headers = @{"Accept" = "application/json"}
Invoke-RestMethod -URI "https://myrestapi/endpoint" -Method:Get -Headers $Headers 

Once I did this, I received the data back in json format and powershell automatically captures it as a system.array object.  Making it immensely easier to work with rather than the xml return.  See below pictures as examples of the difference.
Json returned data.

xml returned data
As you can see, the return I received when in json looks like any other object return from something like powercli whereas the xml return is this ridiculous mess.  Not all Rest API endpoints work in the same fashion.  Some will return json by default, but listing the "Accept" = "application/json" in your request header doesn't seem to hurt those that don't unless you want a different type of return.


VMware: ESXi 6 503 Service Unavailable endpoint: [N7Vmacore4Http16LocalServiceSpecE:0x1f098b08] _serverNamespace = / _isRedirect = false _port = 8309)

Summary:
Basically I enabled SR-IOV on the only two pNIC's I had in my ESXi host in my lab.  This doesn't necessarily cause a connectivity problem, but the ESXi management agents did not like this at all.  Meaning I could connect to my hosts, as evident in the error message, but the agents basically broke once SR-IOV was enabled on the only two physical uplinks I had.



Workaround:
Unfortunately, the only workaround I've found is to:

  • "Reset System Configuration" from DCUI
    • This basically bring ESXi back to default install config.  root password is blanked out, etc. etc.
or
  • Re-Deploy the host. 
For my testing though, I can enable on one of the physical uplinks and work w/ that just fine, just not both in my case.  The other aspect that I didn't realize is that the CNA cards I was using lose their Fiber Channel connectvity as well.

SR-IOV effectively changed my CNA cards to NIC adapters only.

Config:
ESXi 6.0 Build 5050593
Dell FX2 - FC630 - Qlogic 57810 adapters

Chrome: On MacOS downloads .OVA files as OVF's.


Summary:
I must not download OVA's all that often.  When I do though, Chrome decides that "nah" you should name that into OVF.  Seems to be a long standing bug w/ Chrome and MacOS.  Not really a huge problem until a you try to import the OVA w/ the file extension of OVF.

Error when trying to import:
Basically the error returned when trying to import an OVA w/ OVF file extension:
"Failed to open OVF descriptor"

Workaround 1:
Stop using Chrome...Haha, just kidding.
On macOS, you simply have to enable "Show all filename extensions" in Finder.

  1.   
    • You'll usually find this icon in your dock.  Probably the most under-stated/used icon in your dock.

Once done, Chrome will download OVA's as is and not change it to OVF like a proper browser.  Honestly, if Chrome does it for OVA's it "might" do it for other file extensions as well.  Firefox and Safari don't have this problem, so if you use either of those, Bravo!

Workaround 2:
Uncheck the "Ask to save each file before downloading" box in Chrome's Advanced Settings.  This bypasses Chrome's macOS finder integration which seems to be the root of the problem.



Reference bug:
https://bugs.chromium.org/p/chromium/issues/detail?id=311218


PowerCLI/Powershell: vCenter Slack Bot

An OVF from Opvizor that gets deployed to any VMware environment for powercli slack integration.  Very simple deployment model.  
Current Model:
  1. Appliance can currently only target one vCenter and one slack bot.
  2. Permissions are granted via account designated in OVF config. 
    • (read only recommended for obvious reasons)
    • all commands requested via slack bot run in context of this account.
  3. Multiple appliances/vCenters can target one slack bot. 
    • (Ref.1 of two appliances/vcenter targeting one slack bot)
  4. Appliances can also be assigned to individually different slack bots. 
    • (Ref.2 of two different slack bots)
  5. Both models can be achieved by simply doubling up OVF deployments.  One that targets a singular bot, while the other targets a default/catch-all bot.
  6. Utilizes powerclicore so, there are limitations to what powercli cmdlets can be utilized and same limits that powershellcore may have too.
References:
ref.1

ref.2
Links:


Neat Tidbit:
  1. Not yet in "Help", but there is a little neat 'alias' function.
    • @yourbotname alias badvms=posh get-view -ViewType VirtualMachine -Filter @{'RunTime.ConnectionState'='disconnected|inaccessible|invalid|orphaned'} | select name
    • @yourbotname badvms
  2. This will run your PowerCLI/Powershell line of code by simply passing your alias. (ref.3)
ref.3

Turbonomic/VMTurbo: Testing target port connectivity from appliance (network troubleshooting tools)


Summary:
Attempting to troubleshoot IP and port connectivity issues on a Turbonomic appliance is a bit difficult.  The target configuration 'target status' doesn't really give enough information and the default toolset in the appliance ssh session doesn't provide telnet, traceroute, or netcat.  As long as your appliance has internet access, you can install these tools fairly easily though.

Details:
Quite simply, assuming nothing changes later, the appliance runs OpenSuSE.  You can make use of zypper to install the additional tools needed such as netcat, telnet, and traceroute:
zypper install netcat-openbsd
Usage of netcat is the same as ESXi which you can reference here or use 'man netcat'.

Installation of telnet and traceroute are a bit more straightforward:
zypper install telnet
zypper install traceroute
Notes:
  1. These tools simply give you an idea of connectivity from the appliance's perspective.
  2. Traceroute can help you determine if anything in between is preventing your connection.  Such as a hardware firewall.
  3. telnet provides same functionality as netcat albeit it's more obvious when a successful connection is made.
  4. I recommend uninstalling tools once you are done just to maintain the system's integrity.