Find Communities by: Category | Product

Hi! I wanted to put something together for using the Dell EMC VMAX technical add-on (TA) & front-end app for Splunk Enterprise 6.5 (and above) to give you all a bit more information about it, setting it up, getting the data into the front-end app and getting you ready for analysing your VMAX storage systems.  I will try to go over all of the functionality so there is no unanswered questions, but if I do manage to miss something by all means let me know in the comments or by private message and I will be sure to follow up on it!  I will include most of the information from the VMAX TA User Guide for Splunk , with some extra bits and pieces here and there that I feel will benefit you in setting it all up on your end.

 

About the Splunk Technical Add-On & App for VMAX

The VMAX TA allows a Splunk admin to collect inventory, performance information, and summary information from VMAX storage arrays. You can then directly analyse the data or use it as a contextual data feed to correlate with other operational or security data in Splunk Enterprise. The VMAX App for Splunk Enterprise allows admins to take the data ingested into Splunk and analyse it to gain insight into VMAX Array inventory and performance data.

 

Currently the TA is version 1.0.1 and the app is version 1.0, both are configured to work with VMAX-3 and All-Flash arrays using Unisphere 8.3. Support for Unisphere 8.4 is coming in the near future! Watch this space for more information when it becomes available

 

Note: I have created a script you can run in your environment to check connectivity to Unisphere, registered VMAX-3 series arrays, performance metrics registration & timestamp confirmation. Click here to get that script.


Data Collection & Source Types

The VMAX TA provides the index-time and search-time knowledge for inventory, performance metrics, and summary information. By default, all VMAX data is indexed into the default Splunk index, this is the ‘main’ index unless changed by the admin.


The Splunk VMAX TA is configured to report events in 5 minute intervals which is the lowest possible granularity for performance metrics reporting.  Event metric values are representative of the value recorded at that point in time on the VMAX. Values shown for an event in Splunk at 10:00am represent their respective values at 10:00am on the VMAX.


The add-on collects many different kinds of events for VMAX including performance, inventory, and summary metrics. Depending on the activity of the Port Groups & Initiators in your environment, there may be events where there are no performance metrics collected. This can be confirmed if there is a metric present in the event named ‘perf_data’ with a value of ‘false. To limit the amount of data collected and stored on a VMAX, only active Port Groups & Initiators are reported against, so it is intended behaviour to have no performance metrics for those which have been inactive for some time.


The source type used for the Splunk Add-on for VMAX is 'dellemc:vmax:rest'. All events are in key=value pair formats. All events have an assigned 'reporting_level' which indicates the level at which the event details, along with the associated VMAX array ID & if reporting at lower levels, the object ID e.g. Storage Group, Director, Host.


Hardware and software requirements for the Splunk TA & App for VMAX

To install and configure the VMAX TA & App, you must have Splunk admin privileges. Because this add-on runs on Splunk Enterprise, all of the Splunk Enterprise system requirements apply.


There are no specific hardware or software requirements for the VMAX TA, it will point towards your existing environment and Unisphere to gather metrics. For the VMAX app you will need to install two additional Splunk apps; Splunk Status Indicator, and Splunk Treemap. These are both available through the Splunkbase or through the Splunk web UI.


Single Instance/Distributed Environment Installations

In a distributed deployment, install the Splunk VMAX TA to your search heads and heavy forwarders. This TA does not support universal forwarders because the TA requires Python. The add-on does not need to be installed on indexers because it does not support universal forwarders or light forwarders, thus parsing occurs on the heavy forwarder rather than on indexers. The app only needs to be installed on the search heads, and requires no additional configuration.

 

For a detailed single/distributed installation instructions, refer to Splunk's "Installing add-ons" that describes how to install an add-on in the following deployment scenarios:

  • Single-instance Splunk Enterprise
  • Distributed Splunk Enterprise
  • Splunk Cloud
  • Splunk Light


Note: I am aware that at present the Splunk TA only allows collection of metrics from a single instance of Unisphere. To get around this, have a distributed Splunk deployment where each forwarded is pointed at a different instance of Unisphere, allowing your indexers and search heads to collect data from more than one instance of Unisphere at a time. This functionality is set to change in future iterations of the Splunk TA.


VMAX TA Installation Considerations

The add-on does not require the ability to modify VMAX configuration. It is highly recommended that you create a read-only user account with proper read capabilities in Unisphere for VMAX.


The VMAX TA works through the RESTful communications between Splunk and Unisphere, so it is necessary to have Unisphere setup and running in your environment with your arrays added. I wont go into details about REST here, but if you would like to know more about it my colleague Paul Martin has put together a great series of blog articles on REST & VMAX to get you started. The first article in that series is 'Getting Started with the REST API'.


Performance of data collection is dependent on many factors, such as VMAX system load, Splunk Enterprise system load, and environmental factors such as network latency.  Before any metrics can be collected from a VMAX you must also ensure that the VMAX is registered to collect performance metrics. This is enabled from within the Unisphere for VMAX Web UI. For more information on enabling performance metrics collection, please see the ‘Registering Storage Systems’ in the ‘Performance Management > Settings’ section of the ‘Unisphere for VMAX Online Help’ guide.


Installing the VMAX TA for Splunk Enterprise

Once you have Splunk set up and running in your environment, there is very little required to get the VMAX TA set up and collecting information. There are no additional requirements or dependencies, so once you have the VMAX TA downloaded from the Splunkbase website or through the app store from within Splunk you are good to go with set up! I am going to go through the process of setting up the TA first, adding VMAX arrays as data inputs afterwards, then finally setting up the VMAX app to start viewing meaningful analysis of your environment through Splunk.


1. Within the Splunk Web UI navigate to Apps > Manage Apps. Click the button ‘Install App from file’


Install1.png


2. Click ‘Choose File’ and select the Splunk VMAX TA. Once selected, click ‘Upload


Install2.png

 

3. Once the upload is complete you will be prompted to restart Splunk to complete the installation, click ‘Restart now’. After Splunk has restarted and you have logged back in, you will get an ‘Install successful’ message. Click ‘Set up now’ to proceed to configuring the Splunk VMAX TA.


Install3.png

 

4. The Splunk VMAX TA configuration screen will ask you for the following environment details:

  1. Unisphere IP Address
  2. Unisphere Port
  3. Unisphere Username
  4. Unisphere Password

Enter these details and click ‘Save’ when complete.


Install4.png


5. To add VMAX data inputs to Splunk, navigate to Settings > Data inputs > Dell EMC VMAX REST, click either on the Add-on name or click ‘Add new’ to add a new VMAX data input

 

 

Install5.png

 

 

6. When adding a VMAX data input, you will be required to enter two values:

  1. Name (A Splunk Web UI name for your own reference)
  2. VMAX Array ID (VMAX Numerical ID)

When you are ready to add the data input, click ‘Next’ to continue

 

 

Install6.png

 

 

7. Within the ‘Dell EMC VMAX REST’ data input you will now see your VMAX listed as an input. To add another input, repeat steps 5 & 6 until all desired inputs have been added (Note: Array IDs removed for this article)


Install7.png

 

 

8. Once your VMAX data inputs have been added to the TA they will start ingesting summary and performance metrics into your specified Splunk index. To start viewing this data straight away navigate to Splunk search and load the VMAX data index chosen when adding the data inputs.

 

 

Install8.PNG.png

 

 

Installing the VMAX App for Splunk

In addition to just having a TA which ingests VMAX data into Splunk, there is a front end app for the Splunk UI which gives you a number of dashboards to analyse your data easily. You can also take the queries from these dashboards and use them to build your own look at the app as a basis from which to build your own full feature dashboard to monitor your entire environment!


Installing and configuring the VMAX app for Splunk is just as easy as installing the VMAX TA...


1. Download the app from Splunkbase first then install it in the same way as the TA, navigate to Apps > Manage Apps. Click the button ‘Install App from file’. Click ‘Choose File’ and select the Splunk VMAX TA. Once selected, click ‘Upload.


2. (Optional) If you are using an index for VMAX data other than the default index you will need to tell the app where to look in order for it to start analysis the data and running the queries against it. To do this, on your Splunk host navigate to:

{splunk_install_location}/etc/apps/Dell-EMC-app-VMAX/default/

 

Copy all of the settings within and create a new macros.conf file in

{splunk_install_location}/etc/apps/Dell-EMC-app-VMAX/local/

 

The macros.conf file allows us to designate specific environment settings for our app (for a full breakdown of the macros.conf file click here). Within the macros.conf file you will see a number of VMAX configuration groups, each of which has a index= value. For each VMAX configuration group change this value to the name of the index you are using for the VMAX performance data and restart Splunk. Defining these index values in the local directory will override the settings defined in the default directory.

 

app1.PNG.png

 

 

To restart Splunk use the Splunk CLI or the web interface, once restarted the VMAX app will be reading the data from your chosen index and presenting all of the information in the various tables, charts, and info-graphics!

 

 

app2.PNG.png

 

app3.PNG.png

 

app4.PNG.png

 

 

Troubleshooting the VMAX TA

Note: I have created a script you can run in your environment to check connectivity to Unisphere, registered VMAX-3 series arrays, performance metrics registration & timestamp confirmation. Click here to get that script.

 

To diagnose problems with your Splunk & VMAX environment, the first place to look for answers is in the log files for the TA and for Splunk itself.  The two log files can be found in $SPLUNK_HOME/var/log/splunk under the names:

ta_DellEMC_vmax_DellEMC_vmax_rest.log

splunkd.log


Before the add-on successfully runs for the first time, error logs go to splunkd.log. After the add-on successfully runs, error logs go to ta_DellEMC_vmax_DellEMC_vmax_rest.log.


The Splunk VMAX TA has been developed to give the end-user as much detail as possible about the activity of the add-on in their environment.  All add-on logged events will either be marked as ‘info’ or ‘error’, depending on the nature of the event. If you are having any issues with the add-on, the logs will be able to give you precise information as to the cause of the problem. These issues could be related, but not limited to:

  • Incorrect add-on configuration
  • Incorrect Array ID
  • VMAX is not performance registered
  • Performance metrics timestamp is not up-to-date


Performance Data Gaps

Depending on the activity of the Port Groups & Initiators in your environment, there may be events where there are no performance metrics collected. This can be confirmed if there is a metric present in the event named ‘perf_data’ with a value of ‘false’. To limit the amount of data collected and stored on a VMAX, only active Port Groups & Initiators are reported against, so it is intended behavior to have no performance metrics for those which have been inactive for some time.


Splunkd Timeout Issues

In environments where there are VMAX storage arrays which are moderately loaded with resources such as Storage Groups, there will be occasions where Splunk cannot gather all of the required data in time before the splunkd service will timeout.

By default, this timeout value is set to 30 seconds in the Splunk configuration file ‘web.conf’. In order to increase this default setting to something more realistic for VMAX data collection, please follow these steps:

  1. Navigate to $SPLUNK_HOME/etc/system/local/
  2. Create a file called 'web.conf' and enter the following (the value of 1200 is for example purposes, the more resources on a VMAX the higher this number may be):
    [settings]
    splunkdConnectionTimeout = 1200
  3. Restart Splunk once the file has been created

 

Now it's up to you!

That is all of the required functionality covered to get you up and running with Splunk in your VMAX environments, as long as you have Unisphere set up and configured beforehand there is next to no effort required to get Splunk set up with it. Once you have your environment details specified and add the data inputs the TA does the rest of the heavy lifting and the app displays all of the data in a very neat and tidy fashion!

 

What's Next....

Both the VMAX TA and app are both in their very first iterations, both offerings are still in their version 1.0.x stage, so at this point in time we are open to any and all suggestions on what you believe should be included, fixed, made better, removed, whatever, in future releases! Without input from the people that use these offerings day in day out it is best guess as to what we think would work best, so any feedback is always welcomed Even if you only installed the TA to ingest data and build your own SPL (Splunk Processing Language) queries and dashboards, we would still very much like to hear from you! You can send me a message through the community network or even better e-mail the VMAX Splunk support alias at  vmax.splunk.support@emc.com. Many Thanks and I hope to hear from you soon!

For the last time, welcome back! Here we are, end of the series, and almost everything covered! What wasn't will be covered today in this article. For our last look at VMAX & OpenStack Ocata we are going to delve deeper into the topic of troubleshooting, I know, I have covered troubleshooting in all of the articles so far, but I felt it could still use a dedicated article. We won't be looking at troubleshooting individual features, but instead troubleshooting the setup and configuration of your environment,and where to look if something isn't working as expected.

 

How to properly troubleshoot issues in your environment

When I am troubleshooting any kind of issues with Cinder & VMAX I always follow the same series of tasks to determine what is wrong.  Almost every time I find the issue is human-error on my part and a quick fix is all the work that is needed to get things running smoothly. When beginning the process of determining what is wrong in my environment I follow these steps:

  1. Check Cinder logs for indication of warning or error debug statements (if you are seeing incorrect behaviour for operations involving attaches to instances be sure to check the Nova logs also!).
  2. If nothing is apparent from the standard Cinder/Nova logs, enable debug mode for the service, restart the service, and attempt the operation again. Additional debug level reporting may provide a better insight into what is going on, and in the event that you may need to escalate the issue debug level logs will be required from your environment.
  3. After you investigate the logs, even if you do or don't find an indication as to what the problem may be, check the configuration of your VMAX back end(s) in cinder.conf to ensure all required parameters are included and correct, and the back end(s) are included in the enabled_backends parameter in the [DEFAULT] section
  4. If everything appears to be correct with your back end config in cinder.conf check your associated XML configuration files, are all required tags included values correct?
  5. If all configuration seems correct, is your ECOM server accessible and running?
  6. If the ECOM is fine, is there successful connectivity in your storage network between your controller and VMAX?
  7. Lastly, check your SSL certificates are valid and imported into your distro correctly. You can also specify the path to the certificate itself, I recommend also trying this to be doubly sure

 

Following the steps above I am usually able to fix any problem myself, as mentioned previously in the series almost every problem encountered is a configuration issue and can be isolated and fixed using the steps above.

 

Enabling debug mode in OpenStack

To enable debug mode for any service in OpenStack, navigate to that service's .conf file in its installation directory and set the debug flag to true in the [default] configuration group. Restart the service to see the change reflected in the service log files.

 

Commonly encountered configuration errors

To give you an idea of where to look or what to do if you encounter some configuration issues with the VMAX Cinder drivers, I will go over some of the most commonly met issues, how to spot them, and how to fix them. An important piece of information to look for in the logs when trying to diagnose an issue is to look for some indication of a specific resource which is the problem, be it a volume type referenced in the error logs, or a service or resource which is inaccessible. Identifying this will narrow your search for the problem dramatically.

 

Note: For each of the problems and solutions below, it is necessary to restart Cinder services after the changes so they propagate through the system. If the issue is with Nova, then Nova services need to be restarted and so on. To save me writing it every time it is safe for you to assume that after each config change you will need to restart the required services so the changes take effect.

 

Misconfigured back end stanza in cinder.conf

If the back end specified in the enabled_backends parameter in the [DEFAULT] section does not match the back end configuration group name you will get 'failed to initialize driver' error back in Cinder volume logs:

 

StanzaWrongSpelling.PNG.png

StanzaWrongSpelling2.PNG.png

 

The screenshots from above hints at a cinder-volume group is not found and following which tells us the affected Cinder service (cinder-volume) and associated problematic VMAX volume type, this makes sense as there is inconsistency between the back end specified in the enabled_backends parameter in the [DEFAULT] section and what is defined for the VMAX configuration group name.

 

StanzaWrongSpelling3.PNG.png


Fixing this issue is easy, just rename the configuration group so that it matches what is specified in the enabled_backends parameter. Restart the Cinder services after the change and everything should be fine!


Misconfigured VMAX back end XML configuration file

There are a few things which can go wrong here so I will cover them all in one go. First up, incorrect spelling, case-sensitivity, and the impact it has. When you specify the XML configuration file in cinder.conf, if the path is incorrect or the XML filename is wrong you will see the error in the screenshot below in the Cinder volume logs. To fix this, either fix the path to the XML file in cinder.conf or change the name of the XML file itself so it matches what is in cinder.conf. Restart the cinder-volume service after the change to clear the error.

 

WrongXMLName1.PNG.png

 

The next most commonly encountered issue is misspelling in the XML tags within the file or missing tags. When you are creating your back end XML configuration file, ensure that you have all required tags included, correct spelling, and correct values for each. After fixing any problems that are found in your XML file, just restart cinder-volume service to clear the error.  For a full and complete breakdown of the XML tags in use, check part 1 of this series for 'installation & setup', section 7 - 'Create your VMAX volume type XML configuration file'

 

Misspelling in XML tags is a bit trickier to spot from the logs but thankfully once identified it is an easy fix.  I have highlighted a few different parts of this screenshot of the logs to make things a bit clearer.

 

WrongXMLName2.PNG.png

 

We know from the top of the error log that the issue is with the volume type VMAX_ISCSI_DIAMOND, but alone that isn't enough. The next parts let us know that a call was made to gather_info from the config_file. Just below this we get an important piece of info, the parseString error is related to an XML function. From these pieces of info alone we can deduce there is a problem with the volume type XML configuration file. The last line 'ExpatError: mismatched tag: line 4, column 28' tells us the exact line and position of the error. Lets go have a look...

 

WrongXMLParam.PNG.png

 

After opening the XML config file it is immediately obvious what the issue is, there is an incorrectly defined XML tag which has been misspelt in this occasion. Fix the misspelling of the tag, restart Cinder services and clear the error to get expected behaviour. I have removed some values from the screenshot above for security reasons, these tags should all have respective values relating to your own environment.

 

If you happen to incorrectly specify a parameter value, say port groups for example, you might not realise the error until later when you attempt some operations in OpenStack. Sticking with port groups as the example, if you specify an incorrect or non-existent VMAX port group in your XML file, you won't know about it until you try to attach a volume from that volume type to an instance or copy an image to it, that's because the error won't occur until we hit an operation where the port group is necessary.

 

WrongXMLParam2.PNG.png

 

Thankfully the cinder-volume logs are very good, and in this case they will tell you exactly what is going on. Fix the incorrect value and restart services to clear the error.

 

SSL Certificate Troubleshooting

Starting with Solutions Enabler 8.3 and newer SSL encrypted communications is enabled by default, so must be enabled and configured for use in our environment. There are only a small few things which can go wrong with configuring SSL so we won't have to look to far if we get errors back about it. The main issues with SSL are:

  • SSL parameters not included in back end in cinder.conf
  • Invalid SSL certificate
  • Certificate not imported into distro correctly
  • Path to SSL certificate is invalid

 

If you don't include the required SSL parameters in your back end stanza in cinder.conf you will see an error like the one in the screenshot below in the cinder-volume logs. The indicator that the issue is with SSL config for your back end is the line 'CIMError: (0, "The web server returned a bad status line ''''")'. The web server in this instance is our ECOM server, which is the server we are trying to connect to using SSL certs, the bad status line although nondescript tells us enough so we know to look at the SSL config.

 

SSLerror1-NoSettings.PNG.png

 

When checking your SSL config in cinder.conf make sure that you have the following included for each and every VMAX back end. The driver_ssl_cert_path is optional, you only need to include the direct path if you do not import the certificates directly into your system.

driver_ssl_cert_verify = True

driver_use_ssl = True

driver_ssl_cert_path = /my_location/ca_cert.pem

 

If you are having issues with certs loaded into the system you might encounter the error below in the cinder-volume logs. There is a known issue surrounding system certs and permissions, but luckily the optional parameter driver_ssl_cert_path will clear this error for you when the Cinder services are restarted.

 

SSL_unable_to_load_cert.PNG.png

 

If you are having issues with the cert itself being verified you will see an error back in the cinder-volume logs similar to the screenshot below. It is easy to determine the issue here, 'certificate verify failed' is self-explanatory - the cert could not be verified for use with the ECOM server. To get around this I would recommend generating a new cert from the ECOM server, if that still does not fix the issue, ensure that the ECOM server specified in your XML file is the same one that you are trying to pull the certs from.  Once you get the new cert from the ECOM server, either update it in your system loaded certs or update the path to the cert in cinder.conf, and restart Cinder services for the change to take effect.

 

SSL_wrong_cert.PNG.png

 

When specifying your ECOM host name in the associated XML configuration file of your VMAX back end, it is important to remember that it is the host name that is used here and not the fully qualified domain name (FQDN). To explain a bit better, a FQDN may look like a host name but it is actually a host name and a domain name together:

 

Host name: ecom_openstack

Domain name: openstack.prod.com

FQDN: ecom_openstack.openstack.prod.com

 

If you specify the FQDN in the XML file instead of the host name, you will get an error back that the host name supplied does not match the x509 certificate contents common name (example screenshot below). To fix this issue just remove the domain name part from the XML config file, leaving only the host name of the ECOM server, and as always, restart Cinder services for the changes to take effect.

 

SSL_fqdn.PNG.png

 


SMI-S/ECOM Server Troubleshooting

When installing Solutions Enabler and the SMI-S/ECOM components the process is fairly self-explanatory, the prompts for user input are only to ask if you want to change values from their recommended defaults, there isn't anything complicated about the process from start to finish.

 

Note: When installing Solutions Enabler & SMI-S, SMI-S is not set to install by default, you must explicitly choose to install this component when installing Solutions Enabler!


The ECOM is usually installed at /opt/emc/ECIM/ECOM/bin on Linux and C:\Program Files\EMC\ECIM\ECOM\bin on Windows. After you install and configure the ECOM, go to that directory and type TestSmiProvider.exe for windows and ./TestSmiProvider for Linux.  Use dv in TestSmiProvider to ensure your VMAX arrays are added.

 

Note: You must discover storage arrays on the SMI-S server before you can use the VMAX drivers. Follow instructions in the SMI-S release notes.  For detailed installation & configuration instructions please see the ‘Solutions Enabler 8.3.0 Installation & Configuration Guide’ and the ‘ECOM Deployment and Configuration Guide’.

 

What can happen is the ECOM server becomes unresponsive or does not operate correctly, in this case it is best just to restart the ECOM server to see if that helps. To restart the ECOM server use the following command on the ECOM server itself (note: the command below assumes the ECOM server is installed and run from the default install location):

 

$ cd /opt/emc/ECIM/ECOM/bin ; ./ECOM -d -c /opt/emc/ECIM/ECOM/conf

 

If when checking the cinder-volume logs you see reference to the ECONNREFUSED this is typically indicative of the ECOM server being inaccessible. Examples of this error are below, the error may vary from occurrence to occurrence but the key part of the log remains the same, the error or exception at the bottom with ECONNREFUSED referenced. If you face this issue in your environment restart the ECOM using the command above, check your network connections are still working as intended and that there is communication between your controller node & ECOM server, a simple ping request can confirm this. Both screenshots below show different ways the same error can be reported in the Cinder volume logs.

 

EcomDown.PNG.png

EcomDown2.PNG.png

 

 

PyWBEM Troubleshooting

PyWBEM is the client which allows the VMAX Cinder drivers to speak to the ECOM server in order to perform system management tasks. It is installed during the configuration of your OpenStack Cinder & VMAX environment, and although there is only one step required for setup it can on occasion produce some problems.

 

How you install PyWBEM varies depending on which version of Python you are using on your OpenStack nodes. If you are using Python 2 in your environment, please install PyWBEM 0.7.0 natively using the command:

 

Ubuntu: $ sudo apt-get install python-pywbem==0.7.0

RHEL/CentOS/Fedora: $ sudo yum install pywbem==0.7.0

OpenSUSE: $ sudo zypper install python-pywbem==0.7.0

 

If you are using Python 3, please install PyWBEM versions 0.8.4 or 0.9.0 using pip, or 0.7.0 using native package installation:

 

All: $ sudo pip install python-pywbem=={0.9.0/0.8.4}

Ubuntu: $ sudo apt-get install python-pywbem==0.7.0

RHEL/CentOS/Fedora: $ sudo yum install pywbem==0.7.0

OpenSUSE: $ sudo zypper install python-pywbem==0.7.0

 

Note: At the time of Ocata's release, PyWBEM 0.9.0 was the most up-to-date version available at the time. Since then, version 0.10.0 has been released, however, this version has not been verified for use with the VMAX Cinder drivers so we would recommend using versions 0.7.0, 0.8.4 or 0.9.0 as outlined above.

 

If you install an incorrect version of PyWBEM for your environment you will see an error in the cinder-volume logs similar to that in the screenshot below. PyWBEM is installed but it is not the correct version, as a result the connection is closed without returning any data (client times out).  Although this looks like any other connection error, we know it is related to PyWBEM thanks to the last trace before the error message which references specifically the PyWBEM package.

 

PyWbemWrong.PNG.png

 

To correct this issue, run the following commands (dependent on your previous installation method):

 

Ubuntu Native: $ sudo apt-get remove --purge -y python-pywbem

RHEL/CentOS/Fedora Native: $ sudo yum remove python-pywbem

OpenSUSE Native: $ sudo zypper remove --clean-deps python-pywbem

Pip: $ sudo pip uninstall pywbem

 

Reinstall PyWBEM afterwards using the correct installation method outlined in the installation & setup guide of this series of articles.

 

When installing PyWBEM on your system, there is another package dependency of PyWBEM that is installed at the same time - M2Crypto. There is no need to get into the specifics of what this package does, what is important in the context of this article is that from time to time this dependency does not install correctly and can cause issues with PyWBEM operations. Issues with M2Crypto manifest themselves in the Cinder volume logs in such a way that it looks like PyWBEM was not installed:

 

PyWbemMissing.PNG.png

 

Fixing this issue requires M2Crypto to be completely removed (purged) from the system and reinstalled through PyWBEM again (when purging M2Crypto from your system PyWBEM will be removed along with it). Depending on how you installed PyWBEM, the method of removal will vary (apt-get remove vs. pip uninstall):

 

 

DistroCommand
Ubuntu

$ sudo apt-get remove --purge -y python-m2crypto

$ sudo pip uninstall pywbem

$ sudo apt-get install python-pywbem

RHEL/CentOS/Fedora

$ sudo yum remove python-m2crypto

$ sudo pip uninstall pywbem

$ sudo yum install pywbem

OpenSUSE

$ sudo zypper remove --clean-deps python-m2crypto

$ sudo pip uninstall pywbem

$ sudo zypper install python-pywbem

 

 

Volume Type Troubleshooting

There is the possibility of human error during the creation of the volume types in OpenStack as it is a manual process, each volume type has to be created, then given a volume_backend_name property to tie it together with the the back end specified in cinder.conf. There are of course more properties you can associate with volume types to provide additional functionality such as QoS, but I am only going to focus on setting up the volume type at its simplest level. For troubleshooting specific functionality (which may involve adding new properties to a given volume type), please see the respective article where I discuss that piece of functionality.

 

One thing that might happen is a misspelling in either the key volume_backend_name or its associated value. In the case of the value being misspelled or not specified in cinder.conf, when you go to create a volume in OpenStack using that volume type you will get an immediate 'error' status on the volume. Whilst it might not be immediately obvious what has happened, and with no indication in the cinder-volume logs, we can look at the cinder-scheduler logs to see what has went wrong. From the scheduler logs it is possible to determine first that no weighed back end was found for the volume (no weighed back end means no valid or usable back end), then from a subsequent error message that there was no valid back end was found.

 

VolumeType_WrongNoneExistent.PNG.png

 

If we dig into the configuration of the back end in this case we find that the volume_backend_name used for the volume type is not the same as the volume_backend_name specified in the back end stanza in cinder.conf, hence the no valid back end found error. To fix this problem you have one of two options, either delete the volume type and create it again with the correct key/value pair, or change the value of volume_backend_name to what it should be as specified in cinder.conf. There is no need to restart Cinder services after either change to the volume type, they will both take effect immediately after the change. The same error as above will appear in the cinder-scheduler logs if volume_backend_name is not included or misspelt, an error will be thrown notifying you of no valid back found.

 

Network Connectivity Troubleshooting

There are two network types which are supported by the VMAX drivers for Cinder; iSCSI & Fibre Channel. Whilst we will recommend going to your storage admin to diagnose any issues with the storage network in your environment, there are a few checks which we can do beforehand to determine if there is a connectivity issue or if more troubleshooting is required. As every environment is different we will only be going through some basic checks to test connectivity mostly and some additional checks for iSCSI multipathing where it has been configured for use.

 

Note: I am assuming that at this point you have set up your port group for use with OpenStack and any other VMAX related configuration completed.

 

iSCSI Troubleshooting

Troubleshooting iSCSI environments is not difficult at the host level as we can use commands such as ping and iscsiadm to determine if we have connectivity and can discover/login to iSCSI targets on the VMAX. We are going to simulate an error with an iSCSI environment whereby volumes are not accessible via the provided port group. In this scenario I am trying to create a bootable volume, but when it comes to copying the image to the volume there are a number of messages which come up beforehand which indicate a problem before we actually see the error/exception.

 

iSCSI_error1.PNG.png

 

There are a number of exceptions which are thrown afterwards but the most relevant of these is the first. The first error message lets us know that there is a problem with the iSCSI connector, indicating a connectivity issue between the Cinder controller node and the VMAX.

 

iSCSI_error2.PNG.png

 

With the errors pointing at a connectivity issue the next place to look is in the port groups designated for use by that volume group, are the port groups valid? If so, is the port status of each port in the port group 'ON'? If the ports are marked as on you can test connectivity to them by using ping commands to the IP interfaces assigned to each iSCSI target port.

 

Note: You can only ping the VMAX iSCSI target ports when there is a valid masking view. An attach operation creates this masking view, but if you are testing iSCSI connectivity before using OpenStack, ensure you have MVs set up in advance of testing.

 

iSCSI_error3.PNG.png

 

Ping commands show that both IP interfaces on the VMAX are accessible via the storage network from the Cinder controller node, so we can deduce at this point that it is not a wider network problem, if it was it is very likely that ping commands would fail here and no response returned to the controller. With network connectivity confirmed, the next step is to test connectivity to the iSCSI targets using iscsiadm commands. The first command to test is iscsiadm discovery, whereby we check is the target accessible over the network.

 

iSCSI_error6.PNG.png

 

With the iSCSI target inaccessible via the IP interface we are starting to narrow in on our issue. With the ports being marked as 'on' in Unisphere, we know it isn't an issue there, so we can check the iSCSI interfaces themselves through the iSCSI dashboard. A quick check can reveal that that although the ports are up as seen in the port group, the associated iSCSI targets are not attached to an IP interface. Once these targets are attached to an IP interface, we can run the  same iscsiadm commands again to test connectivity.

 

iSCSI_error7.PNG.png

 

Successful iscsiadm discovery commands will return the IP address, port, and IQN of the iSCSI target (screenshot above).  We can take this one step further again to see is it possible to log in to the iSCSI target, confirming the functionality needed for our VMAX/OpenStack attach operations.

 

iSCSI_error8.PNG.png

 

When we return to OpenStack now to attempt another bootable volume creation, there are no issues this time with volume connectivity or launching an instance and attaching it to the bootable volume.

 

There are other errors which may be presented back when running the iscsiadm commands, an example being where the IP interfaces are up but the ports behind them are offline (found after checking the port status in the associated port group).

 

iSCSI_error4.PNG.png

 

The important thing to remember when troubleshooting iSCSI connections is the point at which the testing steps outlined here fail, that usually points towards the underlying issue:

  1. Are the ports in the port group online? If not, enable them and try the operation again
  2. Is the IP interfaces accessible via ping commands from the controller node? If not, check IP interfaces in iSCSI dashboard in Unisphere, create/enable if necessary, and try operation again
  3. Is it possible to discover the iSCSI targets behind the IP interface using iscsiadm discovery commands? Is the target IQN returned? If not, attach iSCSI target to IP interface and try operation again
  4. Is it possible to log in to the iSCSI target? Do you get a successful log notification back? If not, check iSCSI setup with storage admin, if the IQN is discoverable there might be restrictions on logging in

 

iSCSI Multipath Troubleshooting

When troubleshooting iSCSI multipath the process is similar to troubleshooting standard iSCSI connections, the only difference is that instead of just checking connectivity between hosts in your environment there is some additional configuration checks required. Setting up multipath in your environment requires a number of packages to be installed in the environment to support the functionality, along with extra flags to be set on the Cinder & Nova nodes and a multipath configuration file on each Nova node specific to the VMAX.  I will go over the configuration checks in this section, troubleshooting the connections is the exact same as troubleshooting standard iSCSI connections.

 

The first step in the process of troubleshooting iSCSI multipath is to ensure that all the required packages are installed. Checking the packages varies from distro to distro, but to just display the minimal required info use the commands below:

 

Ubuntu: $ sudo dpkg-query -W package_name

RedHat/SLES: $ sudo rpm -q --info package_name

mpio_1.PNG.png


If the package is shown in the list output from the command then it is installed successfully on the node, if you get a 'package not found' error then there is a problem with the package installation or it is not installed. Reinstall the missing package and attempt the iSCSI multipath operation again. Also, each of the packages listed as required for iSCSI multipath need to be installed on each of the nodes on your environment, so it is imperative that you check each node, having one node without the packages will result in failures in multipath operations.

 

If each node has all of the required packages installed, the next step is to check the /etc/multipath.conf file which contains VMAX specific configuration information for multipath functionality. The contents of the multipath files are detailed in the installation & setup guide from the start of this series, check this file to make sure it matches what is in the guide.  The multipath.conf file needs to be present on all Nova compute nodes in your environment, having one node without the configuration file will result in failures in multipath operations.

 

In addition to the multipath configuration file being required on all Nova compute nodes, there are extra flags which must be set on all Cinder and Nova nodes. On all Nova compute nodes, add the following flag in the [libvirt] section of /etc/nova/nova.conf:

 

iscsi_use_multipath = True

 

On all Cinder controller nodes, set the multipath flag to true in the [default] section of /etc/cinder/cinder.conf:

 

use_multipath_for_image_xfer = True

 

That is it in terms of required setup for multipath, at the end of all this you should have:

  1. The required packages installed on all nodes
  2. The VMAX multipath.conf file on all Nova compute nodes
  3. The iscsi_use_multipath = True flag set in the [libvirt] section of /etc/nova/nova.conf on all Nova compute nodes
  4. The use_multipath_for_image_xfer = True flag set in cinder.conf on all Cinder controller nodes

 

Once all the required steps are complete and you have checked packages and node configurations, restart the iSCSI & OpenStack services to ensure the changes are being propagated through the environment:

 

DistroCommands
Ubuntu

$ service open-iscsi restart

$ service multipath-tools restart

$ service nova-compute restart

$ service cinder-volume restart

RHEL/CentOS/SLES/openSUSE

$ systemctl restart open-iscsi

$ systemctl restart multipath-tools

$ systemctl restart nova-compute

$ systemctl restart cinder-volume

 

Once all of the configuration checks are complete and services are restarted, try to perform an operation in OpenStack in which multipath is tested with VMAX as the storage back end. If the operation is not successful and you are still getting errors back, start the iSCSI troubleshooting section which goes through pinging your iSCSI IP interfaces and running iscsiadm discovery and log in commands. If you find that some paths work but others do not, then you need to investigate those individual paths to see why they are unusable.

 

Fibre Channel (FC) SAN Troubleshooting

FC SAN troubleshooting is more complicated than troubleshooting iSCSI environments, a lot of the set up and configuration is done by the SAN admin and thus out of the scope of this guide as each environment is inherently different from the next.  If Zone Manager is used to manage your fabric, the OpenStack Official Openstack Ocata documentation on Zone Manager might provide some useful information. Apart from that all we can check is that the FC ports are up on the VMAX and the HBAs are up on the host and logged in to the fabric.  You can find out detailed information  about the FC HBAs in the folder /sys/class/fc_host/:

 

fc_1.PNG.png

 

The directories host2 and host4 in the example above contain information specific to each adapter like node name (WWN), port name (WWN), type, speed,state etc. Using the directory host names we can find detailed information about the HBAs using the systool command:

 

$  systool -c fc_host -v host2

fc_3.png

 

The most important parts of the output from the systool command are 'port_state' and 'fabric_name'. The port state indicates if the HBA is offline or online, and a value in the fabric name indicates the HBA is logged in to a SAN fabric.  If the port state is offline or there is no fabric name, you need to get your SAN admin to take a closer look to determine why.

 

Miscellaneous Issues

Oslo rpc_response_timeout

OpenStack Oslo use an open standard for messaging middleware known as AMQP. This messaging middleware (the RPC messaging system) enables the OpenStack services that run on multiple servers to talk to each other.

 

By default, the RPC messaging client is set to timeout after 60 seconds, meaning if any operation you perform takes longer than 60 seconds to complete the operation will timeout and fail.

 

rpc_timeout.PNG.png

 

Changing this default is very straightforward in OpenStack, you only need to change the rpc_response_timeout flag value in cinder.conf and nova.conf on all Cinder and Nova nodes and restart the services to increase this timeout value.

 

What to change this value to will depend entirely on your own environment, you might only need to increase it slightly, or if your environment is under heavy network load it could need a bit more time than normal. Fine tuning is required here, change the value and run intensive operations to determine if your timeout value matches your environment requirements.

 

Nova Block Device Allocation

Another operation in OpenStack with timeouts set by default is block device allocation in Nova. Similar to rpc_response_timeout, when an operation using block device allocation exceeds the default timeout of 60 seconds it will fail. As we are working with block storage on VMAX, this timeout may be exceeded if your environment is under heavy load or if the block device being allocated is bigger than your normal block device. This error will appear with the message 'block device mapping invalid', looking in the Nova compute logs will provide more insight into what is going on. We can see from the screenshot below that there was an issue waiting for the allocated time for block device mapping (note: I changed these values to force this error, so it is not normal to a failure after waiting 2 seconds and 2 attempts).

 

block_mapping1.PNG.png

 

To increase the block device allocation default times in Nova, change the values of the following flags in nova.conf on all Nova nodes and restart Nova services afterwards for the changes to take effect.

 

FlagDescription
block_device_allocate_retriesNumber of times to retry block device allocation on failures
block_device_allocate_retries_intervalThis option allows the user to specify the time interval between consecutive retries
block_device_creation_timeoutTime in secs to wait for a block device to be created

 

And it's a wrap!

I would like to thank you for joining me for this series I have put together over the last few weeks. I hope I have covered everything there is in the VMAX & OpenStack ecosphere, but if there is anything you believe I have missed then let me know in the comments below or via private message and I will see what I can do!  Next time around, it won't be Ocata I am looking at but the next release up in the OpenStack cycle, Pike! There are lots, and I mean lots, of changes coming in the next release, each of which I am really excited about writing about and sharing with the world, expect more as the time grows closer! Until then, good day!