Log Aggregation with ELK + Kafka

Date / / Posted by / Demitri Swan / Category / Development, Open Source, Operations, Scaling

At Moz we are re-evaluating how we aggregate logs. Log aggregation helps us troubleshoot systems and applications, and provides data points for trend analysis and capacity planning. As our use of bare-metal systems, virtualized machines and docker containers continues to increase, we venture to discover alternative performant methods of accumulating and presenting log data. With the recent release of ELK stack v5 (now the Elastic Stack), we took the opportunity to put the new software to the test.

ELK+K

In a perfect world, each component within the ELK stack performs optimally and service availability remains consistent and stable. In the real world, log-bursting events can cause saturation, log transmission latency, and sometimes log loss. To handle situations such as this, messages queues are often employed to buffer logs while upstream components stabilize. We chose Kafka for its direct compatibility with Logstash and Rsyslog, impressive performance benchmarks, fault tolerance and high availability.

We deployed Kafka version 0.10.0.0 and Kafka Manager via Ansible to 3 bare-metal systems. Each Kafka broker is configured with a single topic for serving log streams, with 3 partitions and a replication factor of 1. While it is recommended to increase the replication factor to ensure high availability of a given topic, we set the replication factor to 1 to establish a base for the test. Kafka Manager provides a Web interface for interacting with the Kafka broker set and observing any latency between the brokers and the Logstash consumers.

Adding X-Pack

We deployed the latest versions of Elasticsearch and Kibana to the same bare-metal systems via Ansible. Additionally, the X-Pack Elasticsearch and Kibana plugins were installed to provide insight into the indexing stability and overall cluster health within the web interface. The plugin implements basic authentication by default, so we’ve enabled anonymous access to the cluster for the purposes of testing.

Logs on Demand

We wrote a Python log generation tool that logs strings of random lengths — between 30 and 800 characters — to Syslog. The generator sleeps between each log emission for a configured interval. Rsyslog templates the messages into JSON and ships them to 1 of the 3 configured Kafka brokers. The log generation tool and Rsyslog daemon are packaged as a Marathon service and deployed on Mesos to enable dynamic control of load generation.

Dockerized Logstash

We chose the logstash:5 image from the official Logstash repository on Docker Hub as the base image for supporting Logstash as a Marathon service running on Mesos. To enable log ingestion from Kafka and transmission to Elasticsearch, we added a custom logstash.conf file to /etc/logstash/conf.d/logstash.conf and updated the CMD declaration to point to said file. While a full-blown Logstash scheduler for Mesos does exist, we chose to move forward with a simple Marathon service for simplicity and flexibility within the context of this test.

The Setup

Three test machines running Kafka, Zookeeper, Elasticsearch, and Kibana:

  • Cisco Systems Inc UCSC-C240
  • 128 GB RAM
  • Data Partition: 8TB of 7200 RPM SATA DRIVES
  • 32 CPU Cores

Mesos Agents running Logstash and Log-Generator:

  • Cisco Systems Inc UCSB-B200
  • 256 GB RAM
  • 32 CPU Cores

Putting it All Together

We gradually increased the number of log-generator Marathon instances and took note of the end-to-end latency, which was derived by subtracting the timestamp within the message from the timestamp reported by Elasticsearch. We also spot-checked the reported lag in Kafka Manager. Lag that fluctuates between positive and negative values indicates stability, while lag that gradually increases signifies a latency of message consumption between the Kafka brokers and the Logstash container set.

The Results

We were able to achieve 1.6 million messages per minute with sub-second end-to-end latency and stable throughput within the Kafka queue. As we approached 2 million logs per minute, we began to observe degradation of end-to-end latency. Nonetheless, this is roughly a 300% performance increase as compared to a similar stack consisting of Redis, Elasticsearch, Logstash and Kibana, similar hardware, and older software (ELK v1.x).

blog-graph
Timelion Graph of Messages Indexed Per Minute
blog-logs
Default Discover Tab Displaying Log Data

While our log aggregation research with Kafka and ELK is still in its early stages, we’ve learned the following:

  • Version 5 of the ELK stack demonstrated greater throughput capacity and faster indexing capability over version 1.
  • Logstash is easier to manage, deploy, and scale as a Marathon service on Mesos than as a traditional service on virtual machines.
  • There is additional room for performance improvement with the Kafka and ELK components through appropriate hardware allocation and performance tuning.

It is likely that we will continue to leverage Kafka and ELK v5 in pursuit of an updated log aggregation solution at Moz.

Comments Off on Log Aggregation with ELK + Kafka

Moz’s Machine Learning Approach to Keyword Extraction from Web Pages

Date / / Posted by / Matt Peters / Category / Data Science, Python

With thanks to Rutu Mulkar, Erin Renshaw, Chris Whitten, Jay Leary and many others!

Introduction

Keyword extraction is an important task for summarizing documents and in this post I’ll explain some of the details and design decisions underlying Moz’s keyword extraction algorithm. Our scalable implementation processes a web page and returns a list of keyword phrases with relevance scores. It blends traditional natural language processing techniques with a machine learning ranking model applied to the web domain. (The machine learning pipeline has been in production in Moz Content and Moz Pro for more than a year and has proven to be robust and stable.)

At Moz, we have numerous product uses for a keyword extraction algorithm. For example, we can use a keyword extraction algorithm to tag and summarize a page with the most salient topics or to build a relationship graph between keywords. For example, by looking at which topics co-occur frequently with “event driven programming” we can find related topics (e.g., “node js”). Moz Pro currently includes a feature based on this idea.

Product requirements, problem formulation and alternate approaches

There are many different approaches to topic extraction and summarization. In our case, we want granular, easily interpretable topics that can be computed efficiently.

Latent Dirichlet Allocation (LDA) is a well-known topic-extraction algorithm that defines a topic as a probability distribution over a vocabulary. As a result, its topics can sometimes be difficult to interpret and it tends to extract broad topics (e.g. “computer science” instead of “in-memory databases” or “event driven programming”).

In addition, we don’t want to restrict our extracted topics to a predefined vocabulary. Ideally, we’d be able to capture infrequently occurring phrases such as names or places and be able to extract important keywords like “Brexit” from web pages even if the algorithm has never seen the token before.

To overcome these issues and align with product goals, we decided to restrict our extracted “topics” to be “keyword phrases that occur on the page.” Assuming the keyword phrases we extract from the page are meaningful (e.g. “event driven programming,” not “instead of a”) this approach simultaneously solves the interpretability and out-of vocabulary problems.

Our approach to keyword extraction

Key-phrase extraction has been well studied in the research literature (see Hasan and Ng, 2014, for a recent review). Many of these studies are domain specific (e.g. keyword extraction from a technical paper, news article, etc.) and use small datasets with only hundreds of labeled examples. For example, Table 1 in Hasan and Ng (2014) lists 13 datasets used in prior studies, only four of which have more then 1,000 documents. The size of the datasets has mostly prevented researchers from building complex supervised models, and many of the published methods are purely unsupervised (with a few exceptions including Dunietz and Gillick, 2014, and Gamon et al, 2013). Additionally, since the notion of a “relevant keyword” is not well defined, collecting labeled data can be difficult and result in noisy labels with a small amount of inter-annotator agreement.

Accordingly, we made two design decisions at the project outset. First, we decided to build a task specific method instead of a general purpose keyword extraction algorithm. This simplified the problem and focused our efforts. Second, to collect training data we decided to avoid the complexity of manually labeling examples. Instead, we devised a way to collect a lot of data in an automatic way. This allowed us to build a complex model with many different types of features.

dataset_generation_web

To build the labeled data we…

  1. Selected a large number of high volume search queries
  2. Ran the queries through a web search engine and collected the top ten results for each query
  3. Fetched the web pages from all search results and cached the raw HTML
  4. Combined the raw HTML with the search query to make pairs (HTML, relevant keyword)

This resulted in a large data set of page-keyword pairs we used to train our machine learning model. We used search volume as a proxy for prevalence on the web so that we had some assurance that the search engine had a large number of pages to choose from when returning the top ten results (and they were therefore highly relevant to the seed query).

Algorithm details

Below you’ll find an illustration of our algorithm. At a high level it extracts keywords from a web page through a two-step process. First, it generates a long list of potential candidates from the page. Then, it uses a machine learning ranking model to rank the candidates by relevance and assign a relevance score.

algorithm_overview

Generating candidate phrases

To generate candidate keywords from the raw HTML, we first parse and dechrome the page (extract the main page content) using Dragnet, our content extraction algorithm. This is important to eliminating most of the text on the page that is irrelevant to the important topics (navigation links, copyright disclaimers, etc.). We also extract some additional structured information at this point: the title tag, the meta description tag, and H1/H2 tags.

Then, we normalize and tokenize the text. Since much of the text on the web is non-standard and most pages include tokens like URLs and email addresses we took care to treat these as single tokens in the tokenizers.

From the tokenized page content, we generate candidates in two ways. The first runs a part-of-speech tagger and noun-phrase chunker on the text and adds all noun-phrases to the candidate list (following Barker and Cornacchia, 2000). Limiting the candidates to noun phrases significantly reduces the set of candidates, while ensuring that the candidates are meaningful.

The second method to generate candidates looks up potential phrases in a modified version of Wikipedia article titles to find important entities that the noun-phrase chunker missed. For example, our noun-phrase chunker will split “Statue of Liberty” into two different candidates, but this step will add it as a candidate.

Relevance ranking model

All of the candidates are passed to a machine learning ranking model that ranks them by relevance. Since we have a large dataset, we were able to include a wide variety of features in the model. They include:

  • Shallow: relative position in document, number of tokens, etc.
  • Occurrence: does the candidate occur in title, H1, meta description, etc.
  • Term frequency: count of occurrences, average token count in the candidate, etc.
  • QDR: information retrieval motivated “query-document relevance” ranking signals including TF-IDF (term frequency X inverse document frequency), probabilistic approaches, and language models. We used our open source library, qdr, to compute these scores
  • POS tags: is the keyword a proper noun, etc.?
  • URL features: does the keyword appear in the URL, etc.?

Since our data set consists of just a single relevant keyword for each page and many unlabeled candidates (some of which are relevant and many of which are not), we could not simply take an off the shelf classifier and use it. Instead, we took a tip from the literature on PU learning. This is an unfortunately named, but interesting and useful subset of machine learning where one is presented with only Positive and Unlabeled data (PU). Among the established approaches to solving PU problems, we chose the one in Learning classifiers from only positive and unlabeled data, Elkan and Noto, 2008. To apply it to our data, we labeled each relevant keyword as positive and all unlabeled ones as negative then trained a binary classifier. The resulting model predicts the probability that each keyword is relevant and ranks the candidates by the predicted probability. A re-scaled version of the model probability serves as our relevance score.

The future

The current algorithm was developed in early 2015 has been in production for more than a year. It has proven to be robust and reliable, but it could be improved in a few ways.

  • Most of the errors in extracted keywords are due to errors from the noun-phrase chunker when it extracts a phrase that is not meaningful. NLP technologies move quickly and there have been two high quality open source parsers released since we developed our algorithm (Parsey McParseface and spaCy). If we were to start this project today, we’d likely choose one of these instead of our homegrown one (mltk).
  • The second area for improvement is in grouping related keywords. For example, when we run our algorithm on this TechCrunch article about people.ai, both “machine learning” and “machine learning algorithms” are extracted as key phrases. However, these could be grouped together adding more diversity to the top ranked keywords. We currently do some grouping of related keywords but it could be more aggressive.

In lieu of iterations on the current algorithm, we are currently focusing efforts on building a “Topic Reach” that learns topic-site authority associations. We have collected several large search indices internally that provide ideal data sets to analyze.