Ibis Project Blog

Python productivity framework for the Apache Hadoop ecosystem. Development updates, use cases, and internals.

Ibis expressions: Pythonic composability and evaluation

The R community, especially my friend Hadley Wickham, has done pioneering work in creating expressive and useful domain-specific languages (DSLs) embedded in the R programming language. This articles gives some technical detail on composability and evaluation semantics in R and related work in Ibis to solve the same kinds of problems in a Pythonic way.

Non-standard evaluation in dplyr

In Hadley's book Advanced R, he gives an extensive look at a tool available to R programmers called non-standard evaluation (NSE). Users of ggplot2 or dplyr (or other Hadleyverse packages) will be familiar with this — it is also used in base R for model formulas.

NB. For the programming language junkies, NSE originates from the homoiconic Lisp tradition of "code as data". The R language implementation is at its core a kind of Lisp interpreter!

Let's take an example from one of the dplyr vignettes on everyone's favorite airlines dataset.

airlines %>%
  filter(!is.na(dep_delay)) %>%
  group_by(year, month, day) %>%
  summarise(delay = mean(dep_delay)) %>%
  arrange(desc(delay)) %>%
  head(5)

As a Python programmer, you might ask "what in tarnation is going on here?". Indeed, very few of the symbols in this code block are valid variables in the local scope. The short story is that the code (or data would be more accurate) inside the function calls is evaluated in the context of the "calling" data frame. So the NSE code

airlines %>% 
  filter(!is.na(dep_delay))

Is equivalent to the more explicit non-NSE code (written here without the pipe %>%):

filter(airlines, !is.na(airlines$dep_delay))

What is really nice about NSE is that you can refer to new data frame columns that appear during a set of composed operations. For example, the step arrange(desc(delay)) refers to delay which appeared during the aggregation summarise(delay = mean(dep_delay)).

The sad thing about NSE is that it's very unpythonic. It goes directly against a core tenet: "Explicit is better than implicit." It also can make debugging more difficult with complex expressions involving many NSE variables and functions.

Doing NSE in Python actually is possible, but it won't make you very many friends. I wrote a blog post about it several years ago.

In the above example, the lack of NSE doesn't hurt us very much:

In [1]:
import ibis
ibis.options.interactive = True
con = ibis.impala.connect('bottou01.sjc.cloudera.com')

t = con.table('wes.airlines_parquet')

expr = (t[t.depdelay.notnull()]
        .group_by(['year', 'month', 'day'])
        .aggregate(t.depdelay.mean().name('delay'))
        .sort_by(ibis.desc('delay'))
        .limit(5))
expr
Out[1]:
   year  month  day      delay
0  2001      9   13  75.139792
1  2001      9   14  47.194577
2  1990     12   21  45.661782
3  1990     12   22  45.222249
4  1990     12   28  43.914432

Okay, so we're able to do the same thing using strings to reference columns, but let's take a more complex example where this will fail us and require writing down local variables. The trouble comes from needing to form some non-trivial expression during a set of composed operations:

In [2]:
expr = (t[t.depdelay.notnull()]
        .mutate(leg=ibis.literal('-').join([t.origin, t.dest]))
        ['year', 'month', 'day', 'depdelay', 'leg'])
expr.limit(10)
Out[2]:
   year  month  day  depdelay      leg
0  1999     12   22        -1  SDF-CVG
1  1999     12   23         2  SDF-CVG
2  1999     12   24         3  SDF-CVG
3  1999     12   25         0  SDF-CVG
4  1999     12   26         3  SDF-CVG
5  1999     12   27         2  SDF-CVG
6  1999     12   28         0  SDF-CVG
7  1999     12   29        -4  SDF-CVG
8  1999     12   30        -8  SDF-CVG
9  1999     12   31        -2  SDF-CVG

Suppose now we wanted to only look at a few legs of interest, like SFO-JFK and SFO-EWR:

In [3]:
expr[expr.leg.isin(['SFO-JFK', 'SFO-EWR'])].limit(10)
Out[3]:
   year  month  day  depdelay      leg
0  1995     12    1         0  SFO-JFK
1  1995     12    2        37  SFO-JFK
2  1995     12    3         1  SFO-JFK
3  1995     12    4         4  SFO-JFK
4  1995     12    5         0  SFO-JFK
5  1995     12    6         1  SFO-JFK
6  1995     12    7         0  SFO-JFK
7  1995     12    8         3  SFO-JFK
8  1995     12   10         2  SFO-JFK
9  1995     12   11        31  SFO-JFK

Hmm.

"Late" binding using functions

One Pythonic solution to the composability conundrum is to pass functions as parameters. This simulates NSE's "late binding" semantics in a way that doesn't break any rules about standard Python best practices.

In [4]:
expr = (t[t.depdelay.notnull()]
        .mutate(leg=ibis.literal('-').join([t.origin, t.dest]))
        .filter(lambda x: x.leg.isin(['SFO-JFK', 'SFO-EWR']))
        .group_by(['year', 'month', 'leg'])
        .aggregate([lambda x: x.depdelay.mean().name('avg_delay'),
                    lambda x: x.count().name('# flights')])
        .sort_by(ibis.desc('avg_delay'))
        .limit(10))
expr
Out[4]:
   year  month      leg  avg_delay  # flights
0  2008     12  SFO-EWR  37.526316        247
1  1987     12  SFO-EWR  34.297030        101
2  2008      6  SFO-EWR  31.103203        281
3  2008      1  SFO-EWR  29.906780        236
4  2008      6  SFO-JFK  29.676998        613
5  1988      1  SFO-EWR  28.398058        103
6  2008      3  SFO-EWR  24.672269        238
7  2000      5  SFO-EWR  23.857923        366
8  2008      7  SFO-JFK  23.512821        624
9  1989      3  SFO-EWR  23.114035        114

It's a bit of a mouthful with those extra lambda's, but it's also kind of nice! Note that Ibis doesn't force you to write one long set of composed operations (feel free to write down as many local variables saving intermediate expressions as you like), but it's nice to have the option! (Also, remind me to avoid Newark airport.)

For the curious, here's the SQL that's generated by this expression:

SELECT t0.`year`, t0.`month`, t0.`leg`, avg(t0.`depdelay`) AS `avg_delay`,
       count(*) AS `# flights`
FROM (
  SELECT *, concat_ws('-', `origin`, `dest`) AS `leg`
  FROM wes.airlines_parquet
  WHERE `depdelay` IS NOT NULL
) t0
WHERE t0.`leg` IN ('SFO-JFK', 'SFO-EWR')
GROUP BY 1, 2, 3
ORDER BY `avg_delay` DESC
LIMIT 10

The pipe method

Another tool to aid composability is the pipe table method, which was also added to pandas in version 0.16.2. Basically, pipe makes the statement

table.pipe(f, **f_kwargs).pipe(g, **g_kwargs)

equivalent to

g(f(table, **f_kwargs), **g_kwargs)

In R, you could write:

table %>% f(...) %>% g(...)

So in Ibis, you could write little helper functions that deal in common table idioms for your application. Let me give you an example that removes any rows containing nulls amongst an input set of columns:

In [5]:
def remove_nulls_in(*cols):
    def f(table):
        cond = None
        for c in cols:
            mask = table[c].notnull()
            cond = mask if cond is None else cond & mask
        return cond
    return f

filtered = t.pipe(remove_nulls_in('depdelay', 'arrdelay'))

Using these tools, you can augment the built-in Ibis DSL constructs to suit your needs, implementing analytics faster with less code.