Your Ultimate Resource for Search Engine Optimization

SEO Journal

Subscribe to SEO Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get SEO Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

SEO Authors: Jerome McFarland, Samuel Scott, Elizabeth White, Pat Romanski, Hovhannes Avoyan

Related Topics: Cloud Computing, Apache Web Server Journal, SEO Journal, Web Analytics, Cloud Data Analytics, Big Data on Ulitzer


Extending and Augmenting Hadoop

Use the right tool for the job

In the last two years, the Apache Hadoop software library has emerged as a veritable Swiss Army knife of data management and analytical infrastructure. The Hadoop toolset has been positioned as a universal platform for all types of commercial and organizational analytical needs.

Hadoop is an ideal solution for use cases in which the data is easily partitioned and distributed. For example, consider keyword searches, a major component of SEO. Simply identifying and counting distinct words in text data is a central part of the process for keyword-based searches. No matter how many pieces of text you have, each document, article, blog post, or other piece of content is distinct from the other.

In order to enable keyword search, a program then computes the number of words in each distinct document or item of text. Clearly, this can be done in isolation: to count the number of times a word occurs in a given set of documents, you can count it within one document, and add up the counts across many documents. Moreover, you can count each document at the same time as another since they are distinct. An Internet-scale search engine like Google essentially leverages this concept to distribute such processing across a large number of simple machines (a cluster).

Another example where Hadoop shines is when it comes to counting the number of times a specific user, as represented by an IP address, has visited a particular web page or website. Again, this can be broken up into a series of smaller problems and spread over multiple machines in a Hadoop cluster. Results from the smaller sets can then be aggregated to obtain the total count.

MapReduce, the programming model behind Hadoop, was designed to address problems in which an operation over a very large dataset can easily be broken into the same operation on smaller datasets. The promise of Hadoop is in the ability to use open source software relatively inexpensively to address this whole class of partitionable problems.

However, there are a number of analytical use cases for which Hadoop is inadequate. In such cases, the Hadoop toolset may need to be augmented and extended with other technologies to properly resolve these problems.

Understanding Graph Connections
In parallel with the emergence of Hadoop, the world of social media has exploded: As of 2012, the social media powerhouse Facebook had more than one billion registered users, according to CEO Mark Zuckerburg.

Social media networks such as Facebook and LinkedIn are driven by a fundamental focus on relationships and connections. For example, Facebook users can now use the service's Graph Search to find friends of friends who live in the same city or like the same baseball team, and the site frequently suggests "people you may know" based on the mutual connections that two unconnected individuals have established. LinkedIn focuses on helping business professionals grow their social networks by helping them find key contacts or prospects who are connected to existing friends or colleagues, and allowing users to leverage those existing relationships to form new connections. The use of such data connections is becoming ever more useful to individuals for enhancing their personal and business lives.

Likewise, the capacity to comprehend and assess such relationships is a key component driving the world of business analytics. For example, business managers frequently want to know the answers to questions such as:

  • What are all the ways in which a person of interest in a crime database may be related to another person of interest?
  • Based on known patterns of suspicious behavior in a corporate network, how can we identify malicious hacking attacks before they have a financial impact on our company?
  • Which of an organization's partners have a financial exposure to the failure of another company?

Take the question of how two people might be connected on social media. This may seem simple, but as soon as you look closely, it's not quite so clear. The simplest example of such a problem is in looking at how two people may be connected on Facebook. They can be friends - a direct connection that is hard to miss. Or they might be friends of friends, which starts getting a little murkier. The connections can be even more distant and difficult to immediately pinpoint. For instance, Person A may be married to someone whose brother is a friend of Person B. Or perhaps they have a shared affiliation, such as attending the same school, working at the same organization, or attending the same church.

In some cases, two individuals' only connection may be sharing a few Likes. These shared affinities may be valuable information to a business if, for example, those Likes happen to be something your organization addresses. In that case, you may want to drill down to those specific people out of the entire billion users on Facebook, so that you can target your online advertising directly to them.

If you think of all the possible ways that one Facebook user can be connected to another user, it is a very different kind of Big Data problem. You cannot simply break up the problem into smaller segments because, by definition, it involves connections that require link analysis. This makes it a problem that Hadoop isn't ideally suited to address.

Link analysis problems occur in many domains beyond social networks. The network of neurons in the brain and the pathways between these neurons is an example. A group of suspicious people and their connections (as observed by their interactions) is another. The network of genes and proteins and their interactions is yet another.

What do you do to solve problems that involve complex relationship patterns and require detailed link analysis? Enter graph analytics.

Graph Analytics
Essentially graphs provide a way of organizing data to specifically highlight relationships. On such a foundation, it is possible to apply a number of simple to complex analytical techniques to understand groups of similar related entities, to identify the central influencer in a social network, or to identify complex patterns of behavior indicative of fraud.

In fact, the secret to Google's search engine success is the use of a specific graph analytics technique called PageRank. Rather than focus on the prevalence of keywords in a web page, Google focused on the relationships between webpages on the World Wide Web and prioritizing results from highly authoritative sites - resulting in astonishing accuracy in determining relevant results for keyword search.

A common, standard way of representing data in this relationship-oriented format is RDF, a W3C standard, which is accompanied by a query language called SPARQL to specifically analyze such data. In the Life Sciences domain, companies and public consortia are increasingly representing data in this form, because this method provides a more comprehensive overview of the data relationships - whether it is gene/protein interactions, or diseases and their genetic characteristics.

Requires Secret Sauce
Since the nature of graphs makes them difficult to partition, Hadoop is not well suited to this class of analytical problems.

As a matter of fact, the problems are even deeper than that: Because of the unpredictable nature of data access while following and analyzing relationships, commodity hardware architectures are fundamentally challenged. Merely grouping machines together does not address these issues, because the challenges posed by graph analytics are at the network level and are not significantly addressed by the computing capacity of a single machine. What is the ideal approach for solving complex problems involving the analysis of relationships in data?

The secret sauce behind the best performing graph analytics tools is massive un-partitioned memory. One tool, for example, uses a memory pool of up to 512TB (half a petabyte) to perform continuous data and link analysis in real-time even as data continues to pour in. This eliminates latency problems and memory scalability issues while customized chips speed performance.

Comparison Table: Hadoop-Graph Analytics



Graph Analytics

Operation mode







Any commodity hardware

Specialized hardware


Must be partitioned

Allows non-partitioned

Query types

Seek specific data answers

Discover relationships, connections


Tables of entities

Relationships between entities

Graph Analytics Use Cases
Graph analytics is a new player in the Big Data game (which, itself, is quite new). Still, the pioneers and early adopters are reporting promising results for graph analytics as an alternative for solving diverse types of problems. Several examples include:

  • Actionable intelligence: QinetiQ North America (QNA) delivers "actionable intelligence" to government customers interested in identifying threats through the detection of non-obvious patterns of relationships in big data. Graph analytics were the obvious approach, for which QinetiQ uses a purpose-built graph analytics appliance running graph-optimized hardware and a graph database. It interacts with the appliance through the industry standard interface RDF/SPARQL, as defined by the Worldwide Web Consortium (W3C).
  • Life sciences: Oak Ridge National Laboratory (ORNL) opted for a graph analytics appliance to conduct research in healthcare fraud and analytics for a leading healthcare payer. In addition to the healthcare fraud detection program, researchers and scientists at ORNL will also apply the capabilities of the graph-analytics appliance to other areas of research where data discovery is vital. These potential use cases include healthcare treatment efficacy and outcome analysis, analyzing drugs and side effects, and the analysis of proteins and gene pathways.
  • Higher education: The Pittsburgh Supercomputer Center (PSC) turned to agraph analytics appliance called Sherlock (no relation to IBM's Watson) to provide researchers with the ability to search extremely large and complex bodies of information using a straightforward command similar to ‘find something important.' Sherlock took advantage of specialized graph analytics hardware to run 128 threads per processor on dedicated hardware and speeded memory access across a terabyte of global shared memory. The appliance helped PSC win public recognition for extending graph analytics techniques to a wide range of scientific research projects.

The potential uses of graph analytics are just beginning to be explored. Already, the technology is being applied across a broad array of industries, including manufacturing, energy and gas exploration, earth sciences and meteorology, and government and defense.

Advantages Offered by Graph Analytics
A key advantage of graphs is the ease with which new sources of data and new relationships can be added. Graph databases using RDF to represent the graph can easily merge and unify diverse datasets without significant upfront investment in data modeling. Such an approach lies in stark contrast to ‘traditional' analytics, in which a great deal of time is spent organizing data, and the addition of new data sources requires time-consuming and error prone effort by analysts.

The easy on-boarding of new data is particularly important when dealing with Big Data. Traditional analytics focus on finding answers to known questions. By contrast, many of the highest value applications, such as those identified above, are focused on discovery, where the questions to be answered are not known in advance. The ability to quickly and easily add new data sources or new relationships within the data when needed to support a new line of questioning is crucial for discovery, and graphs are uniquely well qualified to support these requirements.

Graph analytics also offer sophisticated capabilities for analyzing relationships, while traditional analytics focus on summarizing, aggregating and reporting on data. Use the right tool for the job. Some common graph analytic techniques include:

  1. Centrality analysis: To identify the most central entities in your network, a very useful capability for influencer marketing.
  2. Path analysis: To identify all the connections between a pair of entities, useful in understanding risks and exposure.
  3. Community detection: To identify clusters or communities, which is of great importance to understanding issues in sociology and biology.
  4. Sub-graph isomorphism: To search for a pattern of relationships, useful for validating hypotheses and searching for abnormal situations, such as hacker attacks.

Complementary to Hadoop
Interestingly, Hadoop and graph analytics complement each other perfectly. Hadoop is a scale-out solution, allowing independent items of work to be parceled out to the computers in a cluster. Graph analytics, on the other hand, excel at looking at the "big picture," analyzing complex networks of relationships that cannot be partitioned.

For example, consider risk analysis within a financial solution. Many documents will need to be independently analyzed, and the relationships between organizations extracted. This is a perfect job for Hadoop since each document is independent of the others. On the other hand, the complex network of relationships between organizations form an un-partitionable graph, which is best analyzed as a single entity, in-memory.

Relationships and Connections
Analysts today have a tabular, "row-and-column" mindset when it comes to data and analytics - probably a byproduct of the spreadsheet's decades of success.

But don't you often think about problems and data in different ways?

Graph analytics explicitly model and reason about the relationships between different entities, and graph tools also display those relationships visually. The analyst can see all the relationships in which an entity participates, and intuitively assess which elements are close or important.

When it comes to customers, relationships, rather than tabular data, may be the most important element: they are more predictive of your likelihood of retaining or losing customers. The more connections customers have to your organization, its products, and its people, the more likely they will remain customers. Relationships, not tables, are also key to hacker and threat identification, risk and fraud analysis, influencer marketing and many other high value applications.

Graph analytics complement Hadoop and provide a level of immediate, deep insights that are not readily obtainable in any other way.

More Stories By Venkat Krishnamurthy

Venkat Krishnamurthy is the Product Management Director at YarcData, driving the direction and definition of YarcData products and solutions and working with customers to make them successful. Krishnamurthy has over a decade of experience in advanced analytics, including as a Director of Product Management at Oracle and as Vice President of Technology at Goldman Sachs. At Goldman, he conducted data analysis to assess risk controls across multiple trading desks/asset classes, algorithmic trading, market risk model validation, prime brokerage.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.