Diving into nPath

A few days ago, Steve posted a concrete example of Aster nPath. Though the example in the above mentioned post was called “straightforward”, I think that for people unfamiliar with nPath syntax, it could have been a little difficult to digest in one glance. It certainly wasn’t immediately obvious to me. So I’ll try to break down the query in that post into little bits and hopefully clarify the syntax further.

First, some context. Here is the problem that Steve posed:

For example, suppose we are interested in the optimization of our website flow in order to retain and engage visitors driven to us by SEO/SEM. We want to answer the question: for SEO/SEM-driven traffic that stay on our site only for 5 or less pageviews and then leave our site and never return in the same session, what are the top referring search queries and what are the top path of navigated pages on our site? In traditional data warehouse solutions, this problem would require a five-way self-join of granular weblog data, which is simply unfeasible for large sites such as Myspace.

And here is the nPath query that answers the above problem (taken from Steve’s post):


SELECT entry_refquerystring, entry_page || “,” || onsite_pagepath as onsite_pagepath, count(*) as session_count

FROM nPath(
ON ( select * from clicks where year = 2009 )
PARTITION BY customerid, sessionid
ORDER BY timestamp
PATTERN ( ‘Entry.Onsite+.OffSite+$’ )
SYMBOLS (
domain ilike “mysite.com” and refdomain ~* “yahoo.com|google.com|msn.com|live.com” as Entry,
domain ilike “mysite.com” as OnSite,
domain not ilike “mysite.com” as OffSite
)
MODE( NONOVERLAPPING )
RESULT(
first(page of Entry) as entry_page,
first(refquerystring of Entry) as entry_refquerystring,
accumulate(page of Onsite) as onsite_pagepath,
count(* of Onsite) as onsitecount_minus1
)
)
WHERE onsitecount_minus1 < 4
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 1000;

Alright, so lets see whats going on here. It is important to always keep the big picture in mind — nPath scans groups of (sequential) rows at a time, searching for user specified patterns. Thus, the first thing that we need to specify is exactly what rows will nPath be operating on. This looks pretty much like a regular SQL query:


ON ( select * from clicks where year = 2009 )
PARTITION BY customerid, sessionid
ORDER BY timestamp

Next, we must specify what is it that we are looking for, or the search pattern. The search pattern is unimaginatively specified via the PATTERN clause:


PATTERN ( ‘Entry.Onsite+.OffSite+$’ )

The pattern description looks very much like any regular expression. The above pattern will find groups of rows where the first row matches “Entry”, followed by one or more “Onsite”s and ending with one or more “Offsite”(s). Next, we need to define what “Entry”, “Onsite” and “Offsite” mean. This is done via the SYMBOLS clause:


SYMBOLS (
domain ilike “mysite.com” and refdomain ~* “yahoo.com|google.com|msn.com|live.com” as Entry,
domain ilike “mysite.com” as OnSite,
domain not ilike “mysite.com” as OffSite
)

A row denotes an “Entry” if the corresponding page is on the domain mysite.com and the referer domain was one of the popular search engines. This makes sure we are not counting clicks from other links on mysite.com. “OnSite” and “OffSite” have similar descriptions.

Once we have identified the pattern and the symbols that make up the pattern, we can do some processing with them. Among other things, nPath allows you to dynamically control the columns of the output. These are specified via the RESULTS clause. Columns can be specified via existing column attributes, or using SQL/nPath aggregates:


RESULT(
first(page of Entry) as entry_page,
first(refquerystring of Entry) as entry_refquerystring,
accumulate(page of Onsite) as onsite_pagepath,
count(* of Onsite) as onsitecount_minus1
)

Finally, we can specify exactly what columns from the output we want to view, and how we want to view them. This is specified just like a regular SELECT clause, which operates on the output of an nPath query instead of regular SQL tables:


SELECT entry_refquerystring, entry_page || “,” || onsite_pagepath as onsite_pagepath, count(*) as session_count

And thats about it as far as the functionality of the query goes — there is of course some more syntactic sugar for specifying other constraints such as GROUP BY and ORDER BY. There is a LOT more to nPath than this, but hopefully it gives you some idea of the capabilities of this extremely powerful construct.

One comment

  1. dave

    Not a very deep analysis. How many rows can be returned per partition? One per matching sequence, or is there one row per matching sequence? If the latter, why is “first” needed? What does it do? How is the degree of parallelism of the function decided? I.e. on how are the nodes over which the input table is partition decided? Is this an optimizer decision or something that can be hinted at? What does the “mode” argument do? What oth

Leave a Reply