Exploring social media data with ELK¶

The ELK (Elasticsearch, Logstash, Kibana) stack is a general-purpose framework for exploring data. It provides support for loading, querying, analysis, and visualization.

SFM provides an instance of ELK that has been customized for exploring social media data. It currently supports data from Twitter and Weibo.

One possible use for ELK is to monitor data that is being harvested to discover new seeds to select. For example, it may reveal new hashtags or users that are relevant to a collection.

Though you can use Logstash and Elasticsearch directly, in most cases you will interact exclusively with Kibana, which is the exploration interface.

Enabling ELK¶

ELK is not available by default; it must be enabled as described here.

An ELK instance is composed of 3 containers: an ElasticSearch container, a Logstash container, and a Kibana container. Each instance can be configured to be loaded with all social media data or the social media data for a single collection set.

To enable an ELK instance it must be added to your docker-compose.yml and then started by:

docker-compose up -d

An example is provided in example.docker-compose.yml and example.prod.docker-compose.yml. These examples also show how to limit to a single collection set by providing the collection set id.

By default, Kibana is available at http://your_hostname:5601/app/kibana. (Also, by default Elasticsearch is available on port 9200 and Logstash is available on port 5000.)

If enabling multiple ELK instances, add additional containers to your docker-compose.yml. Make sure to give each container a unique name (e.g., “elasticsearch2”), hostname: value (e.g., “sfm_es_2”), ports, cluster.name and node.name.

ELK requirements¶

For the host server:

Docker >= 1.12 is required.
The vm_max_map_count kernel setting needs to be set to at least 262144 for production use. For detail setting, please see the ElasticSearch documentation. If not, you will see an error like:
```
ERROR: bootstrap checks failed
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
```
At the time of writing, there are problems running the ElasticSearch Docker container on OS X.

Configuring ELK¶

For production use, there are a number of best practices for configuration to be aware of.

Elasticsearch¶

For a discussion of recommended configuration settings, see the ElasticSearch Docker documentation.

Use the ES_JAVA_OPTS environment variable to set heap size, e.g. to use 2GB use ES_JAVA_OPTS="-Xms2g -Xmx2g". It is also recommended to set a memory limit (mem_limit) for the container that should be equal to or great than the java memory. For best practices, assign enough memory (e.g. 6GB) for ElasticSearch.

Kibana¶

Kibana waits for ElasticSearch to start. However, it may take a long time for ElasticSearch to start completely. By default, a large wait time has been set but you may find it necessary to make it even larger. To set the wait time, please check the docker-compose.yml file and set the corresponding value to WAIT_SECS.
For production use, set LOGGING_QUIET to true to suppress all logging output other than error messages. For development purpose, you can set the log level based on the following table:

In large dataset, you might encounter an error with a query with a large time interval, e.g. 3 years or 5 years. By default ElasticSearch rejects search requests that would query more than 1000 shards. The error would be like:

To bypass this limit, update the action.search.shard_count.limit cluster setting to a greater value like 2000 or more. To do this, go to the Dev Tools tab on Kibana and run following code:

PUT _cluster/settings
{
  "persistent": {
    "action.search.shard_count.limit":2000
  }
}

Occasionally, you might encounter the following field error when opening a Kibana dashboard.

To solve this problem, you click the Management tab and then go to the Index Patterns page. Refresh the field list.

For details, see this discussion page.

Logstash¶

Logstash waits for ElasticSearch to start. However, it may take a long time for ElasticSearch to start completely. By default, a large wait time has been set but you may find it necessary to make it even larger. To set the wait time, please check the docker-compose.yml file and set the corresponding value to WAIT_SECS.
Limit to a single collection set by providing the collection set id.

X-Pack monitoring¶

To enable X-Pack monitoring, you will need to change the X-Pack environment variables to true in the configuration for the ElasticSearch and Kibana containers in docker-compose.yml.

The default value is false since it involves license management even though the monitoring feature is free for the basic license. The basic license will expire in one month.

To update your license, please follow these instructions.

Loading data¶

ELK will automatically be loaded as new social media data is harvested. (Note, however, that there will be some latency between the harvest and the data being available in Kibana.)

Since only new social media data is added, it is recommended that you enable the ELK Docker container before beginning harvesting.

If you would like to load social media data that was harvested before the ELK Docker container was enabled, use the resendwarccreatedmsgs management command:

usage: manage.py resendwarccreatedmsgs [-h] [--version] [-v {0,1,2,3}]
                                       [--settings SETTINGS]
                                       [--pythonpath PYTHONPATH] [--traceback]
                                       [--no-color]
                                       [--collection-set COLLECTION_SET]
                                       [--harvest-type HARVEST_TYPE] [--test]
                                       routing_key

The resendwarccreatedmsgs command resends warc_created messages which will trigger the loading of data by ELK.

To use this command, you will need to know the routing key. The routing key is elk_loader_<hostname>.warc_created. The hostname is available as part of the definition of the ELK container in the docker-compose.yml file.

The loading can be limited by collection set (--collection-set) and/or (--harvest-type). You can get collection set ids from the collection set detail page. The available harvest types are twitter_search, twitter_filter, twitter_user_timeline, twitter_sample, and weibo_timeline.

This shows loading the data limited to a collection set:

docker exec sfm_ui_1 python sfm/manage.py resendwarccreatedmsgs --collection-set b438a62cbcf74ad0adc09be3b07f039e elk_loader_myproject_elk.warc_created

Another option for loading data from line-oriented JSON files or WARC files is to use a warc iterator. Warc iterators are commandline tools that can be used to prepare data for loading into ELK.

The corresponding harvester warc iterator has two options:

usage: twitter_rest_warc_iter.py [-h] [--pretty] [--dedupe]
                                 [--print-item-type] [--debug [DEBUG]]
                                 [--elkwarc [ELKWARC]] [--elkjson [ELKJSON]]
                                 filepaths [filepaths ...]

The elkwarc and elkjson options help you load data from WARC and JSON files directly into ELK. Here is a simple Twitter example:

Loading warc:

twitter_rest_warc_iter.py --elkwarc=true <your_warc_files> | /usr/share/logstash/bin/logstash -f stdin.conf

Loading json:

twitter_rest_warc_iter.py --elkjson=true <your_json_files> | /usr/share/logstash/bin/logstash -f stdin.conf

Overview of Kibana¶

The Kibana interface is extremely powerful. However, with that power comes complexity. The following provides an overview of some basic functions in Kibana. For some advanced usage, see the Kibana Reference or the Kibana 101: Getting Started with Visualizations video.

When you start Kibana, you probably won’t see any results.

This is because Kibana defaults to only showing data from the last 15 minutes. Use the date picker in the upper right corner to select a more appropriate time range.

Tip: At any time, you can change the date range for your query, visualization, or dashboard using the date picker.

Discover¶

The Discover tab allows you to query the social media data.

By default, all social media types are queried. By limit to a single type (e.g., tweets), click the Open and select the appropriate filter.

You will now only see results for that social media type.

Notice that each social media item has a number of fields.

You can search against a field. For example, to find all tweets containing the term “archiving”:

or having the hashtag #SaveTheWeb:

or mentioning @SocialFeedMgr:

Visualize¶

The Visualize tab allows you to create visualizations of the social media data.

The types of visualizations that are supported include:

Area chart
Data table
Heatmap chart
Line chart
Markdown widget
Metric
Pie chart
Tag cloud
Title Map
Timeseries
Vertical bar chart

Describing how to create visualizations is beyond the scope of this overview.

A number of visualizations have already been created for social media data. (The available visualizations are listed on the bottom of the page.)

For example, here is the Top 10 hashtags visualization:

Dashboard¶

The Dashboard tab provides summary view of data, bringing together multiple visualizations and searches on a single page.

A number of dashboards have already been created for social media data. To select a dashboard, click the folder icon and select the appropriate dashboard.

For example, the Kibana default dashboard is Twitter, here is the top of the Twitter dashboard:

Caveats¶

This is experimental. We have not yet determined the level of development that will be performed in the future.
Approaches for administering and scaling ELK have not been considered.
No security or access restrictions have been put in place around ELK.
Including the X-Pack security and account management may be considered in the future.