The R community, especially my friend Hadley Wickham, has done pioneering work in creating expressive and useful domain-specific languages (DSLs) embedded in the R programming language. This articles gives some technical detail on composability and evaluation semantics in R and related work in Ibis to solve the same kinds of problems in a Pythonic way.
Non-standard evaluation in dplyr
In Hadley's book Advanced R, he gives an extensive look at a tool available to R programmers called non-standard evaluation (NSE). Users of
dplyr (or other Hadleyverse packages) will be familiar with this — it is also used in base R for model formulas.
NB. For the programming language junkies, NSE originates from the homoiconic Lisp tradition of "code as data". The R language implementation is at its core a kind of Lisp interpreter!
Let's take an example from one of the dplyr vignettes on everyone's favorite airlines dataset.
airlines %>% filter(!is.na(dep_delay)) %>% group_by(year, month, day) %>% summarise(delay = mean(dep_delay)) %>% arrange(desc(delay)) %>% head(5)
As a Python programmer, you might ask "what in tarnation is going on here?". Indeed, very few of the symbols in this code block are valid variables in the local scope. The short story is that the code (or data would be more accurate) inside the function calls is evaluated in the context of the "calling" data frame. So the NSE code
airlines %>% filter(!is.na(dep_delay))
Is equivalent to the more explicit non-NSE code (written here without the pipe
What is really nice about NSE is that you can refer to new data frame columns that appear during a set of composed operations. For example, the step
arrange(desc(delay)) refers to
delay which appeared during the aggregation
summarise(delay = mean(dep_delay)).
The sad thing about NSE is that it's very unpythonic. It goes directly against a core tenet: "Explicit is better than implicit." It also can make debugging more difficult with complex expressions involving many NSE variables and functions.
Doing NSE in Python actually is possible, but it won't make you very many friends. I wrote a blog post about it several years ago.
In the above example, the lack of NSE doesn't hurt us very much:
import ibis ibis.options.interactive = True con = ibis.impala.connect('bottou01.sjc.cloudera.com') t = con.table('wes.airlines_parquet') expr = (t[t.depdelay.notnull()] .group_by(['year', 'month', 'day']) .aggregate(t.depdelay.mean().name('delay')) .sort_by(ibis.desc('delay')) .limit(5)) expr
year month day delay 0 2001 9 13 75.139792 1 2001 9 14 47.194577 2 1990 12 21 45.661782 3 1990 12 22 45.222249 4 1990 12 28 43.914432
Okay, so we're able to do the same thing using strings to reference columns, but let's take a more complex example where this will fail us and require writing down local variables. The trouble comes from needing to form some non-trivial expression during a set of composed operations:
expr = (t[t.depdelay.notnull()] .mutate(leg=ibis.literal('-').join([t.origin, t.dest])) ['year', 'month', 'day', 'depdelay', 'leg']) expr.limit(10)
year month day depdelay leg 0 1999 12 22 -1 SDF-CVG 1 1999 12 23 2 SDF-CVG 2 1999 12 24 3 SDF-CVG 3 1999 12 25 0 SDF-CVG 4 1999 12 26 3 SDF-CVG 5 1999 12 27 2 SDF-CVG 6 1999 12 28 0 SDF-CVG 7 1999 12 29 -4 SDF-CVG 8 1999 12 30 -8 SDF-CVG 9 1999 12 31 -2 SDF-CVG
Suppose now we wanted to only look at a few legs of interest, like
year month day depdelay leg 0 1995 12 1 0 SFO-JFK 1 1995 12 2 37 SFO-JFK 2 1995 12 3 1 SFO-JFK 3 1995 12 4 4 SFO-JFK 4 1995 12 5 0 SFO-JFK 5 1995 12 6 1 SFO-JFK 6 1995 12 7 0 SFO-JFK 7 1995 12 8 3 SFO-JFK 8 1995 12 10 2 SFO-JFK 9 1995 12 11 31 SFO-JFK
"Late" binding using functions
One Pythonic solution to the composability conundrum is to pass functions as parameters. This simulates NSE's "late binding" semantics in a way that doesn't break any rules about standard Python best practices.
expr = (t[t.depdelay.notnull()] .mutate(leg=ibis.literal('-').join([t.origin, t.dest])) .filter(lambda x: x.leg.isin(['SFO-JFK', 'SFO-EWR'])) .group_by(['year', 'month', 'leg']) .aggregate([lambda x: x.depdelay.mean().name('avg_delay'), lambda x: x.count().name('# flights')]) .sort_by(ibis.desc('avg_delay')) .limit(10)) expr
year month leg avg_delay # flights 0 2008 12 SFO-EWR 37.526316 247 1 1987 12 SFO-EWR 34.297030 101 2 2008 6 SFO-EWR 31.103203 281 3 2008 1 SFO-EWR 29.906780 236 4 2008 6 SFO-JFK 29.676998 613 5 1988 1 SFO-EWR 28.398058 103 6 2008 3 SFO-EWR 24.672269 238 7 2000 5 SFO-EWR 23.857923 366 8 2008 7 SFO-JFK 23.512821 624 9 1989 3 SFO-EWR 23.114035 114
It's a bit of a mouthful with those extra
lambda's, but it's also kind of nice! Note that Ibis doesn't force you to write one long set of composed operations (feel free to write down as many local variables saving intermediate expressions as you like), but it's nice to have the option! (Also, remind me to avoid Newark airport.)
For the curious, here's the SQL that's generated by this expression:
SELECT t0.`year`, t0.`month`, t0.`leg`, avg(t0.`depdelay`) AS `avg_delay`, count(*) AS `# flights` FROM ( SELECT *, concat_ws('-', `origin`, `dest`) AS `leg` FROM wes.airlines_parquet WHERE `depdelay` IS NOT NULL ) t0 WHERE t0.`leg` IN ('SFO-JFK', 'SFO-EWR') GROUP BY 1, 2, 3 ORDER BY `avg_delay` DESC LIMIT 10
Another tool to aid composability is the
pipe table method, which was also added to pandas in version 0.16.2. Basically,
pipe makes the statement
table.pipe(f, **f_kwargs).pipe(g, **g_kwargs)
g(f(table, **f_kwargs), **g_kwargs)
In R, you could write:
table %>% f(...) %>% g(...)
So in Ibis, you could write little helper functions that deal in common table idioms for your application. Let me give you an example that removes any rows containing nulls amongst an input set of columns:
def remove_nulls_in(*cols): def f(table): cond = None for c in cols: mask = table[c].notnull() cond = mask if cond is None else cond & mask return cond return f filtered = t.pipe(remove_nulls_in('depdelay', 'arrdelay'))
Using these tools, you can augment the built-in Ibis DSL constructs to suit your needs, implementing analytics faster with less code.