Skip to main content
Version: Atlas v4.0

Using Data Utilization

Banner

Data Utilization

The Atlas Data Utilization Element provides Splunk owners with comprehensive insights about the utilization of the data in your environment. Use Data Utilization to identify utilization trends, discover which data sources are getting the most or least amount of usage by your Splunk users, and understand how utilization correlates to your daily data ingest. Data Utilization allows for a rapid assessment of what business value a specific Splunk is providing because you can see who is using and how it is being used.

The information provided by the Data Utilization Element is crucial for supporting optimization efforts, implementing data hygiene best practices in your Splunk environment, and helping to ensure you're getting optimal from your data. The goal of Data Utilization is to support critical decision making for when you may need to stop ingesting data that is no longer being used to save on system resources or to discover data that is being underutilized and could be used to support new valuable use cases. These are just some of the use cases supported by Atlas Data Utilization.

Data Utilization Capabilities

  • Analyzes Index and Source Type utilization across ad-hoc searches, scheduled searches, and dashboards

  • Provides key metrics that identify underutilized datasets along with the volume of splunk license it represents

  • Provides a comprehensive list of all datasets by index:sourcetype or index:sourcetype:source and their specific utilization stats

  • Provides query counts, license usage, and queries/GB distribution metrics by dataset in multiple formats

  • Identifies which users, searches, and dashboards are utilizing data along with which installed Splunk app is impacted

  • Quickly view the SPL being run against data

  • Easily identify inefficient searches being run by your Splunk users

  • Provides a comprehensive view of data utilization across your Splunk environment in a single view

Utilization Dashboard

To begin, there are selections at the top of the page to filter on dataset type, dataset, utilization time range, and license usage time range. Use these filters to isolate the resulting data to only what you are interested in.

The power of the Atlas Data Utilization Element is in it's ability to quickly identify underutilized datasets, analyze their usage, and accurately determine if that dataset should be discontinued to save on storage and license usage or, just as importantly, determine if that data warrants more attention because it offers organizational value that is not being extracted.

What types of utilization can and cannot be detected by Atlas Data Utilization?

Splunk does not natively log information about data utilization, which makes tracking this information when you need it very challenging. Atlas accomplishes this by using pattern matching and data correlations, but in some cases, utilization cannot be captured at this time.

  • For full accuracy, a search must return the sourcetype field in its results. For searches that do not contain this field, Atlas calculates a best guess. This guess is accurate in most cases.

  • If the index, sourcetype, or source fields of a search are contained within a macro or part of a data model search, Atlas is unable to identify this as utilization..

Splunk's internal logs are updated regularly throughout their history of releases. The older the Splunk version, the less granular the logs tend to be.

  • Prior to Splunk version 8.1, Atlas is unable to identify which dashboards queries were run against but only that they ran against a dashboard.

  • Prior to Splunk version 9.0, Atlas is unable to detect queries from Dashboard Studio 2.0 dashboards.

Top Bar

The filters available at the top of Data Utilization page impact the results of the entire page. The filter options are as follows:

  • Dataset Type: Allows you to choose which level you want to view utilization metrics. The options are Index, Index:Source Type, Index:Source Type:Source.
  • Dataset: Filters the results down to specific datasets. The default value is All.
  • Utilization Time Range: The time range for which to calculate data utilization. The default value is 24 hours.
  • License Usage Time Range: The time range for which to calculate license usage. The default value is 24 hours.

Key Metrics

Below the filtering options is the Data Utilization Key Metrics panel. This gives Admins a high-level overview of Data Utilization within their Splunk environment in the form of three Key Performance Indicators (KPIs) that provide comparative counts of underutilized datasets, and how much the Splunk license is under/over-utilized in both percentage and raw gigabytes.

Key Metrics

There are two ways to modify the view of the KPIs. The Utilization Threshold and Query Threshold filters can be used to modify the way the KPIs are calculated.

  • Utilization Threshold: Allows you to chose which dimension of utilization you want to see in the results. The options are Total Queries, Queries/Day, and Queries/10GB. This field is used in coordination with the data input field that is next to it.

    • Total Queries includes all queries run against a data set within the selected time range
    • Queries/Day bases the determination of underutilized on the number of queries per day
    • Queries/10GB bases the determination of underutilized on the number of queries per 10GB of license usage
  • Query Threshold: Allows you set a threshold value that makes sense for your use case. The default value is 1, but you can change it to any number that you want. For example, if you decide that 10 Total Queries in a time period is your ideal threshold to identify an underutilized dataset, the KPIs will adjust to show you anything under 10 total queries in the selected time range.

There are 3 KPIs that are included in the Key Metrics section. Each KPI in the Key Metrics section is based on the options chosen in the Dataset Type, Utilization Threshold, Query Threshold, and the selected time ranges. The purpose of each KPI is described as follows:

  • Datasets Underutilized: Displays how many datasets fall into the underutilized category based on the selected options.

  • License Usage Underutilized (GB): Displays how much of your license usage in GB is considered underutilized based on the selected options.

  • License Usage Underutilized (%): Displays the percentage of your license usage is considered underutilized based on the selected options.

A practical example for how to interpret the results that are displayed in Data Utilization?

Let's assume that you want to determine how many of your indexes in Splunk have had no utilization at all in the last 30 days. In this situation you know that you likely have some old Splunk apps in your environment that are running scheduled queries so we need to set a Query Threshold that is higher than 1 (which means 0 queries). You decide to choose a query threshold of 10 because anything below 10 total queries would be considered underutilized in your environment. You would use the following settings to achieve this output:

  • The Dataset Type would be set to Index
  • The Dataset would be set to the default value of All
  • The Utilization Time Range would be set to Last 30 Days
  • The License Utilization Time Range would be to Last 30 Days
  • The Utilization Threshold would be set to Total Queries
  • The Query Threshold would be set to 10

The results will show you the following:

  • Datasets Underutilized will show you the number of indexes underutilized out of the total number of indexes found in your environment.
  • License Usage Underutilized(GB) will show you the total amount data in GB that is underutilized.
  • License Usage Underutilized (%) will show you the percentage of your license usage that is underutilized.

Utilization Overview

The Utilization Overview panel displays utilization by dataset. These results show the total number of ad-hoc, scheduled, and dashboard queries within the selected time frame. It also shows the amount that the queries contribute to your license utilization. The license utilization values are color coded so users can quickly identify datasets that are consuming license but are not being searched very often. Administrators might want to consider deprecating this data, or seek to increase utilization to get more value from the data - effecting optimal use of resources.

Utilization

The Utilization by Dataset table provides results to show how data is currently being searched in three different categories:

  • Ad Hoc Queries: This column shows the number of times this dataset has been included by any user using Splunk to execute a search
  • Scheduled Queries: This column shows how many times a dataset was searched by Scheduled Searches
  • Dashboard Queries: This column shows how many times a dataset was searched by queries built into dashboard panels
  • License Usage: The amount of license in GB that is being used by the dataset
  • Queries/GB: A ratio from the number Splunk queries executed per GB of license consumed by the dataset

The Utilization Overview section is concluded by three visuals that show the data distribution of the three types of queries that are tracked by Data Utilization.

In the Utilization by Dataset table, you can select an item that you would like to investigate further. Once selected, the bottom half of the dashboard will populate with more detailed information about the selected item.

Investigating Queries

After selecting a specific dataset that you want to investigate further, the Investigating Queries panel will appear below the Utilization Overview section. Here, you can inspect the queries being executed against the selected dataset in detail.

Investigation 1

The query investigation section contains the following information:

  • Users Ad Hoc Searching On Data: The number of users performing ad-hoc searches against the selected dataset
  • Scheduled Searches Querying Data: The number of scheduled searches querying the selected dataset
  • Dashboard Searched Querying Data: The number of dashboard searches querying the selected dataset
  • Timeline: A timeline visual that shows the types of queries executed against the selected dataset over the selected time range
  • Query Type Distribution: Shows the distribution of query types executed against the selected dataset over the selected time range

Clicking on the numeric value for Users Ad Hoc Searching on Data, Scheduled Searches Querying Data, and Dashboard Searches Querying Data will cause a table to appear that shows a corresponding list of items that were discovered to have been utilizing data in the selected time range.

The last component of the Investigating Queries section is a table that shows the Splunk searches that ran against the data being investigated. This table is helpful in determining exactly which activity is taking place on a dataset so that you know exactly the type of activity that is taking place.

Investigation 2

This table contains the following fields:

  • Host: The host where the search was executed from.
  • Time: The time that the search was executed.
  • Type: The type of search it is (Dashboard, Ad Hoc, Scheduled)
  • Provenance The provenance
  • App The App that search was run from.
  • Name The name of the search.
  • User The user that ran the search.
  • Definite (Yes/No) If a search's results include the sourcetype field, the log will list which specific source types have been queried. For searches which lack this information, Atlas evaluates the SPL to determine which datasets have been queried. A query is definite if Atlas can determine the index and sourcetype from the SPL.
  • Query The SPL of the actual query.

Investigating Fields

The Investigating Fields panel provides a table of all the fields found in the selected dataset, the type of field, statistical metrics, and how often each field was utilized in the three types of searches analyzed. This information is important so that you can understand the fields being searched and what types of searches they are being included in.

Investigation 3

The Fields table contains the following fields:

  • Field: The field name included in searches.
  • Field Type: The type of field that it is.
  • Times Present: The number of times the field was present in the searches.
  • Unique Values: The number of unique values presented in the search for the field.
  • Prevalence: The prevalence of the field.
  • Ad Hoc: The number of times the field appeared in an Ad Hoc search in the selected time range.
  • Scheduled: The number of times the field appeared in a Scheduled search in the selected time range.
  • Dashboard: The number of times the field appeared in a dashboard query in the selected time range.
  • Total: The total number of occurrences of the field appearing in any type of search.