Exploring social media data with ELK¶
The ELK (Elasticsearch, Logstash, Kibana) stack is a general-purpose framework for exploring data. It provides support for loading, querying, analysis, and visualization.
SFM provides an instance of ELK that has been customized for exploring social media data. It currently supports data from Twitter and Weibo.
One possible use for ELK is to monitor data that is being harvested to discover new seeds to select. For example, it may reveal new hashtags or users that are relevant to a collection.
Though you can use Logstash and Elasticsearch directly, in most cases you will interact exclusively with Kibana, which is the exploration interface.
Enabling ELK¶
ELK is not available by default; it must be enabled as described here.
An ELK instance is composed of 3 containers: an ElasticSearch container, a Logstash container, and a Kibana container. Each instance can be configured to be loaded with all social media data or the social media data for a single collection set.
To enable an ELK instance it must be added to your docker-compose.yml
and then started by:
docker-compose up -d
An example is provided in example.docker-compose.yml
and example.prod.docker-compose.yml
. These examples
also show how to limit to a single collection set by providing the collection set id.
By default, Kibana is available at http://your_hostname:5601/app/kibana. (Also, by default Elasticsearch is available on port 9200 and Logstash is available on port 5000.)
If enabling multiple ELK instances, add additional containers to your docker-compose.yml
. Make sure to give each
container a unique name (e.g., “elasticsearch2”), hostname:
value (e.g., “sfm_es_2”), ports, cluster.name
and node.name
.
ELK requirements¶
For the host server:
Docker >= 1.12 is required.
The
vm_max_map_count
kernel setting needs to be set to at least 262144 for production use. For detail setting, please see the ElasticSearch documentation. If not, you will see an error like:ERROR: bootstrap checks failed max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
At the time of writing, there are problems running the ElasticSearch Docker container on OS X.
Configuring ELK¶
For production use, there are a number of best practices for configuration to be aware of.
Elasticsearch¶
For a discussion of recommended configuration settings, see the ElasticSearch Docker documentation.
Use the ES_JAVA_OPTS environment variable to set heap size, e.g. to use 2GB use ES_JAVA_OPTS="-Xms2g -Xmx2g"
. It
is also recommended to set a memory limit (mem_limit
) for the container that should be equal to or great than the
java memory. For best practices, assign enough memory (e.g. 6GB) for ElasticSearch.
Kibana¶
- Kibana waits for ElasticSearch to start. However, it may take a long time for ElasticSearch to start completely. By
default, a large wait time has been set but you may find it necessary to make it even larger. To set the wait time, please
check the
docker-compose.yml
file and set the corresponding value toWAIT_SECS
. - For production use, set
LOGGING_QUIET
to true to suppress all logging output other than error messages. For development purpose, you can set the log level based on the following table:
- In large dataset, you might encounter an error with a query with a large time interval, e.g. 3 years or 5 years. By default ElasticSearch rejects search requests that would query more than 1000 shards. The error would be like:
To bypass this limit, update the action.search.shard_count.limit
cluster setting to a greater value like 2000 or more.
To do this, go to the Dev Tools
tab on Kibana and run following code:
PUT _cluster/settings
{
"persistent": {
"action.search.shard_count.limit":2000
}
}
- Occasionally, you might encounter the following field error when opening a Kibana dashboard.
To solve this problem, you click the Management
tab and then go to the Index Patterns
page. Refresh the field list.
For details, see this discussion page.
Logstash¶
- Logstash waits for ElasticSearch to start. However, it may take a long time for ElasticSearch to start completely. By
default, a large wait time has been set but you may find it necessary to make it even larger. To set the wait time, please
check the
docker-compose.yml
file and set the corresponding value toWAIT_SECS
. - Limit to a single collection set by providing the collection set id.
X-Pack monitoring¶
To enable X-Pack monitoring, you will need to change the X-Pack environment variables to true in the configuration for the ElasticSearch and Kibana containers in docker-compose.yml.
The default value is false since it involves license management even though the monitoring feature is free for the basic license. The basic license will expire in one month.
To update your license, please follow these instructions.
Loading data¶
ELK will automatically be loaded as new social media data is harvested. (Note, however, that there will be some latency between the harvest and the data being available in Kibana.)
Since only new social media data is added, it is recommended that you enable the ELK Docker container before beginning harvesting.
If you would like to load social media data that was harvested before the ELK Docker container was enabled, use the
resendwarccreatedmsgs
management command:
usage: manage.py resendwarccreatedmsgs [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color]
[--collection-set COLLECTION_SET]
[--harvest-type HARVEST_TYPE] [--test]
routing_key
The resendwarccreatedmsgs
command resends warc_created messages which will trigger the loading of data by ELK.
To use this command, you will need to know the routing key. The routing key is elk_loader_<hostname>.warc_created
.
The hostname is available as part of the definition of the ELK container in the docker-compose.yml
file.
The loading can be limited by collection set (--collection-set
) and/or (--harvest-type
). You can get collection
set ids from the collection set detail page. The available harvest types are twitter_search, twitter_filter,
twitter_user_timeline, twitter_sample, and weibo_timeline.
This shows loading the data limited to a collection set:
docker exec sfm_ui_1 python sfm/manage.py resendwarccreatedmsgs --collection-set b438a62cbcf74ad0adc09be3b07f039e elk_loader_myproject_elk.warc_created
Another option for loading data from line-oriented JSON files or WARC files is to use a warc iterator. Warc iterators are commandline tools that can be used to prepare data for loading into ELK.
The corresponding harvester warc iterator has two options:
usage: twitter_rest_warc_iter.py [-h] [--pretty] [--dedupe]
[--print-item-type] [--debug [DEBUG]]
[--elkwarc [ELKWARC]] [--elkjson [ELKJSON]]
filepaths [filepaths ...]
The elkwarc and elkjson options help you load data from WARC and JSON files directly into ELK. Here is a simple Twitter example:
Loading warc:
twitter_rest_warc_iter.py --elkwarc=true <your_warc_files> | /usr/share/logstash/bin/logstash -f stdin.conf
Loading json:
twitter_rest_warc_iter.py --elkjson=true <your_json_files> | /usr/share/logstash/bin/logstash -f stdin.conf
Overview of Kibana¶
The Kibana interface is extremely powerful. However, with that power comes complexity. The following provides an overview of some basic functions in Kibana. For some advanced usage, see the Kibana Reference or the Kibana 101: Getting Started with Visualizations video.
When you start Kibana, you probably won’t see any results.
This is because Kibana defaults to only showing data from the last 15 minutes. Use the date picker in the upper right corner to select a more appropriate time range.
Tip: At any time, you can change the date range for your query, visualization, or dashboard using the date picker.
Discover¶
The Discover tab allows you to query the social media data.
By default, all social media types are queried. By limit to a single type (e.g., tweets), click the Open and select the appropriate filter.
You will now only see results for that social media type.
Notice that each social media item has a number of fields.
You can search against a field. For example, to find all tweets containing the term “archiving”:
or having the hashtag #SaveTheWeb:
or mentioning @SocialFeedMgr:
Visualize¶
The Visualize tab allows you to create visualizations of the social media data.
The types of visualizations that are supported include:
- Area chart
- Data table
- Heatmap chart
- Line chart
- Markdown widget
- Metric
- Pie chart
- Tag cloud
- Title Map
- Timeseries
- Vertical bar chart
Describing how to create visualizations is beyond the scope of this overview.
A number of visualizations have already been created for social media data. (The available visualizations are listed on the bottom of the page.)
For example, here is the Top 10 hashtags visualization:
Dashboard¶
The Dashboard tab provides summary view of data, bringing together multiple visualizations and searches on a single page.
A number of dashboards have already been created for social media data. To select a dashboard, click the folder icon and select the appropriate dashboard.
For example, the Kibana default dashboard is Twitter, here is the top of the Twitter dashboard:
Caveats¶
- This is experimental. We have not yet determined the level of development that will be performed in the future.
- Approaches for administering and scaling ELK have not been considered.
- No security or access restrictions have been put in place around ELK.
- Including the X-Pack security and account management may be considered in the future.