Ibis Project Blog

Python productivity framework for the Apache Hadoop ecosystem. Development updates, use cases, and internals.

Ibis Design: Expression types and operation mechanics

This blog post describes Ibis's expression type system and how its DSL (domain-specific language) operations work. It's intended for curious users or developers of the project.

I spent a lot of time thinking about how to create a deferred expression DSL that compiles to SQL and is fluid and intuitive to use. No small task.

The key things I want to address are:

  • Ibis's internal operation types and external user-facing types
  • How the DSL semantics help you avoid making common mistakes
  • How type checking and evaluation works

At the end, since several people have asked me about it, I show some contrasts in Ibis's API design with Blaze, another Python project, that has its own embedded DSL.

Ibis expression basics

In general, all objects in Ibis semantically referencing real data fall into 3 categories:

  • Tables: Database tables, data in formats like Parquet and Avro, or derived table expressions
  • Arrays: Columns in tables or derived expressions thereof
  • Scalars: Literal values and results of aggregations or analytic functions

I'll give you a bunch of basic examples of these and then give more technical detail on the type system.

Let's just take the number 5:

In [1]:
import ibis

value = 5

Suppose you wanted to create an expression that uses this value. Ibis has the function literal that converts a Python scalar value to a well-typed expression. Most of the time this happens automatically, but let's do it explicitly:

In [2]:
int_expr = ibis.literal(5)
int_expr
Out[2]:
Literal[int8]
  5

This code didn't actually do anything. It created a literal expression containing the value 5. What you see in brackets is the output type of the operation Literal. What is int_expr, then?

In [3]:
type(int_expr)
Out[3]:
ibis.expr.types.Int8Scalar

Okay, let's figure out this int_expr thing.

In [4]:
int_expr + 10
Out[4]:
Add[int8]
  Literal[int8]
    5
  Literal[int8]
    10
In [5]:
type(int_expr + 10)
Out[5]:
ibis.expr.types.Int8Scalar
In [6]:
expr = (int_expr + 10).sqrt()
expr
Out[6]:
Sqrt[double]
  Add[int8]
    Literal[int8]
      5
    Literal[int8]
      10

Read the expression tree inside out. So 10 got automatically converted to a literal, then they're Add-ed together, finally Sqrt is applied. Note that:

In [7]:
type(expr)
Out[7]:
ibis.expr.types.DoubleScalar

Let's round it to 2 decimal places for good measure:

In [8]:
rounded = expr.round(2)

print('Type is {0!r}'.format(type(rounded)))

rounded
Type is <class 'ibis.expr.types.DoubleScalar'>

Out[8]:
Round[double]
  Sqrt[double]
    Add[int8]
      Literal[int8]
        5
      Literal[int8]
        10
  Literal[int8]
    2

Now, if you wanted to run this expression inside Impala, it would get translated to this SQL:

In [9]:
print(ibis.impala.compile(rounded))
SELECT round(sqrt(5 + 10), 2) AS `tmp`

Indeed:

In [10]:
con = ibis.impala.connect('bottou01.sjc.cloudera.com')
con.execute(rounded)
Out[10]:
3.8700000000000001

Let's look at a table now:

In [11]:
db = con.database('ibis_testing')
t = db.functional_alltypes
t
Out[11]:
DatabaseTable[table]
  name: ibis_testing.`functional_alltypes`
  schema:
    id : int32
    bool_col : boolean
    tinyint_col : int8
    smallint_col : int16
    int_col : int32
    bigint_col : int64
    float_col : float
    double_col : double
    date_string_col : string
    string_col : string
    timestamp_col : timestamp
    year : int32
    month : int32
In [12]:
type(t)
Out[12]:
ibis.impala.client.ImpalaTable

This says ImpalaTable, but it's really TableExpr with a few extra methods that are Impala-specific. For experienced Pythonistas, you'll know the mro method gives you the full inheritance hierarchy:

In [13]:
type(t).mro()
Out[13]:
[ibis.impala.client.ImpalaTable,
 ibis.expr.types.TableExpr,
 ibis.expr.types.Expr,
 ibis.client.DatabaseEntity,
 object]

Now let's compute the same expression from above on tinyint_col, and sum it for kicks. I'll write it with one line per operation and add comments that show the operation name and output type of each operation:

In [14]:
expr = (t             # DatabaseTable -> TableExpr
        .limit(20)    # Limit         -> TableExpr
        .tinyint_col  # TableColumn   -> Int8Array
        .add(10)      # Add           -> Int16Array
        .sqrt()       # Sqrt          -> DoubleArray
        .round(2)     # Round         -> DoubleArray
        .sum())       # Sum           -> DoubleScalar

expr
Out[14]:
ref_0
DatabaseTable[table]
  name: ibis_testing.`functional_alltypes`
  schema:
    id : int32
    bool_col : boolean
    tinyint_col : int8
    smallint_col : int16
    int_col : int32
    bigint_col : int64
    float_col : float
    double_col : double
    date_string_col : string
    string_col : string
    timestamp_col : timestamp
    year : int32
    month : int32

sum = Sum[double]
  Round[array(double)]
    Sqrt[array(double)]
      Add[array(int16)]
        tinyint_col = Column[array(int8)] 'tinyint_col' from table 
          Limit[table]
            Table: ref_0
            n:
              20
            offset:
              0
        Literal[int8]
          10
    Literal[int8]
      2
  None

As desired:

In [15]:
type(expr)
Out[15]:
ibis.expr.types.DoubleScalar

And:

In [16]:
con.execute(expr)
Out[16]:
75.760000000000005
In [17]:
print(expr.compile())
SELECT sum(round(sqrt(`tinyint_col` + 10), 2)) AS `sum`
FROM (
  SELECT *
  FROM ibis_testing.`functional_alltypes`
  LIMIT 20
) t0

Here's the important thing: the user never interacts with the actual operations. As a user, all you see are well-typed expressions having a particular value type (scalar, array, or table). This makes your mode of interaction consistent and 100% reliable from step to step.

How operations yield expressions

To help understand how the DSL is constructed, let's deconstruct a simple string operation:

In [18]:
expr = ibis.literal('wes mckinney')
type(expr)
Out[18]:
ibis.expr.types.StringScalar
In [19]:
chopped = expr.substr(4)
chopped
Out[19]:
Substring[string]
  Literal[string]
    wes mckinney
  Literal[int8]
    4
  None
In [20]:
chopped.execute()
Out[20]:
'mckinney'

Here is the implementation of Substring:

class Substring(ValueOp):

    input_type = [string, integer(name='start'),
                  integer(name='length', optional=True)]
    output_type = rules.shape_like_arg(0, 'string')

What's going on when I call expr.substr(4) is, in fact:

In [21]:
import ibis.expr.operations as ops

op = ops.Substring(expr, 4)

type(op)
Out[21]:
ibis.expr.operations.Substring

Among other things, this validated that the arguments match the input_type signature.

The user will never directly interact with this object. All that matters is that its output is another string.

In [22]:
op.output_type()
Out[22]:
ibis.expr.types.StringScalar

Internally, Ibis calls the operation instance's to_expr method which wraps the operation in a StringScalar expression:

In [23]:
chopped = op.to_expr()
chopped
Out[23]:
Substring[string]
  Literal[string]
    wes mckinney
  Literal[int8]
    4
  None

Helping you avoid invalid operations

NB. This addresses a frequent rough edge I experienced when using NumPy to build pandas; namely that there is a single ndarray type whose values are only distinguished by the dtype attribute. pandas is now full of workarounds to provide data type-specific functionality on Series objects.

Ibis tries to save you from forming invalid expressions in two ways:

  • All instance methods on an expression are guaranteed to be valid (assuming there is a translation rule for your target SQL engine).
  • Type-checking happens immediately, informing you of any mistakes

Let's have a closer look at this. I'll use a simplified table (not referencing any real data) to make the output here more compact:

In [24]:
t2 = ibis.table([('v1', 'double'),
                 ('v2', 'int32'),
                 ('v3', 'string')], 'hamster')
t2
Out[24]:
UnboundTable[table]
  name: hamster
  schema:
    v1 : double
    v2 : int32
    v3 : string

Let's look at the string column v3:

In [25]:
t2.v3
Out[25]:
ref_0
UnboundTable[table]
  name: hamster
  schema:
    v1 : double
    v2 : int32
    v3 : string

v3 = Column[array(string)] 'v3' from table ref_0

The numeric methods we used above are not even available on v3:

In [26]:
t2.v3.sum()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-26-fbd74f99b121> in <module>()
----> 1 t2.v3.sum()

AttributeError: 'StringArray' object has no attribute 'sum'

This constraint applies to all expressions. Suppose we casted v1 to string; this would still be true:

In [27]:
t2.v1.sum()
Out[27]:
ref_0
UnboundTable[table]
  name: hamster
  schema:
    v1 : double
    v2 : int32
    v3 : string

sum = Sum[double]
  v1 = Column[array(double)] 'v1' from table ref_0
  None
In [28]:
t2.v1.cast('string').sum()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-28-b4e33ca864fd> in <module>()
----> 1 t2.v1.cast('string').sum()

AttributeError: 'StringArray' object has no attribute 'sum'

In addition to not allowing you to call invalid instance methods, input arguments are all type-checked:

In [29]:
expr = t2.v1.round(2)  # valid

t2.v1.round('foo')  # invalid
---------------------------------------------------------------------------
IbisTypeError                             Traceback (most recent call last)
<ipython-input-29-844feacfdc97> in <module>()
      1 expr = t2.v1.round(2)  # valid
      2 
----> 3 t2.v1.round('foo')  # invalid

/Users/wes/code/ibis/ibis/expr/api.pyc in round(arg, digits)
    955         other numeric types: double
    956     """
--> 957     op = _ops.Round(arg, digits)
    958     return op.to_expr()
    959 

/Users/wes/code/ibis/ibis/expr/types.pyc in __init__(self, *args)
    269 
    270     def __init__(self, *args):
--> 271         args = self._validate_args(args)
    272         Node.__init__(self, args)
    273 

/Users/wes/code/ibis/ibis/expr/types.pyc in _validate_args(self, args)
    276             return args
    277 
--> 278         return self.input_type.validate(args)
    279 
    280     def root_tables(self):

/Users/wes/code/ibis/ibis/expr/rules.pyc in validate(self, args)
    345             args = list(args) + [t.default for t in self.types[n:]]
    346 
--> 347         return self._validate(args, self.types)
    348 
    349     def _validate(self, args, types):

/Users/wes/code/ibis/ibis/expr/rules.pyc in _validate(self, args, types)
    356                 msg = ('Argument {0}: {1}'.format(i, exc) +
    357                        '\nArgument was: {0}'.format(ir._safe_repr(args[i])))
--> 358                 raise IbisTypeError(msg)
    359 
    360         return clean_args

IbisTypeError: Argument 1: not integer
Argument was: 'foo'

Expression type hierarchies

The user expression types inherit from a particular value type along with a shape type (table, array, or scalar).

Methods are shared amongst type classes, and the classes assist with type validation, implicit casts, and other internal matters in the DSL and SQL compiler toolchain.

Some Ibis and Blaze DSL comparisons

Since DSLs are inherently nuanced things, the only real way to evaluate them fully is to try to solve your problem using one of them.

Architectural similarities are interesting, but in my experience API design / usability and comprehensive solutions to use cases are the only things that users care about. Blaze today conflicts with my (definitely opinionated) API design preferences and my understanding of the needs of SQL users (the intended audience for Ibis). It was much simpler for me to design a solution from scratch (whose design can be quickly changed without project governance issues) that meets my target users' design and usability requirements.

In an earlier post here, I described how our priorities with Ibis are out of alignment with Blaze. I had a phone call with Continuum Analytics this past week to discuss Ibis and Blaze goals, use cases, and possible collaborations. They asked me for specific feedback on Blaze's API, so I'm digging in here to look closer.

In [30]:
import blaze as bz

t = bz.TableSymbol('hamster', '{v1: float64, v2: int32, v3: string}')
t
Out[30]:
hamster
In [31]:
type(t)
Out[31]:
blaze.expr.expressions.Symbol

Okay, how about the class hierarchy?

In [32]:
type(t).mro()
Out[32]:
[blaze.expr.expressions.Symbol,
 blaze.expr.expressions.Expr,
 blaze.expr.core.Node,
 object]

Alright. You can find out what kind of expression you're dealing with by looking at the dshape attribute:

In [33]:
t.dshape
Out[33]:
dshape("var * {v1: float64, v2: int32, v3: string}")

This is very NumPy-like, which is good and bad for reasons I've cited.

Let's look at a few simple expressions like the above:

In [34]:
expr = t.v1.sum()

In Ibis, this yields a DoubleScalar. Here it's a little different because there are no value types, only Expr:

In [35]:
expr
Out[35]:
sum(hamster.v1)
In [36]:
type(expr)
Out[36]:
blaze.expr.reductions.sum
In [37]:
type(expr).mro()
Out[37]:
[blaze.expr.reductions.sum,
 blaze.expr.reductions.Reduction,
 blaze.expr.expressions.Expr,
 blaze.expr.core.Node,
 object]
In [38]:
expr.dshape
Out[38]:
dshape("float64")

You also can't do sum on a string:

In [39]:
t.v3.sum()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-32d7bb6499a7> in <module>()
----> 1 t.v3.sum()

/Users/wes/anaconda/envs/ibis-blog/lib/python2.7/site-packages/blaze/expr/expressions.pyc in __getattr__(self, key)
    168             pass
    169         try:
--> 170             result = object.__getattribute__(self, key)
    171         except AttributeError:
    172             fields = dict(zip(map(valid_identifier, self.fields),

AttributeError: 'Field' object has no attribute 'sum'

Most methods in Blaze are not instance methods, and are instead found in blaze.expr.functions:

In [40]:
import blaze.expr.functions as bzfunc

Sadly, things start going poorly when I make mistakes using functions in bzfunc:

In [41]:
expr = bzfunc.sum(t.v3)
expr
Out[41]:
sum(hamster.v3)

Wait, wasn't v3 a string?

In [42]:
expr.dshape
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-42-b75fb756f813> in <module>()
----> 1 expr.dshape

/Users/wes/anaconda/envs/ibis-blog/lib/python2.7/site-packages/blaze/expr/reductions.pyc in dshape(self)
     64                           for i, d in enumerate(self._child.shape)
     65                           if i not in axis)
---> 66         return DataShape(*(shape + (self.schema,)))
     67 
     68     @property

/Users/wes/anaconda/envs/ibis-blog/lib/python2.7/site-packages/blaze/expr/reductions.pyc in schema(self)
    115     @property
    116     def schema(self):
--> 117         return DataShape(datashape.maxtype(super(sum, self).schema))
    118 
    119 

/Users/wes/anaconda/envs/ibis-blog/lib/python2.7/site-packages/datashape/typesets.pyc in maxtype(measure)
    182     if isoption:
    183         measure = measure.ty
--> 184     assert matches_typeset(measure, scalar), 'measure must be numeric'
    185 
    186     if measure == bool_:

AssertionError: measure must be numeric

Let me try something else, like sqrt:

In [43]:
expr = bzfunc.sqrt(t.v3)
expr.dshape
Out[43]:
dshape("var * float64")

So that's not good either. I couldn't find a round function. Let me try adding a double to a string:

In [44]:
expr = t.v1 + t.v3
expr
Out[44]:
hamster.v1 + hamster.v3

I don't think that should work either. Also:

In [45]:
expr.dshape
Out[45]:
dshape("var * object")

Invalid table-level expressions are also not problematic at all:

In [46]:
expr = t[t.v3 > 5]

I also find the spread of functionality between blaze.expr.functions and instance methods to conflict with my sensibilities (which are admittedly highly subjective and pandas-influenced). For example, a float scalar object has some functions but others are found elsewhere.

In [47]:
expr = bzfunc.sum(t.v1)
print(expr.dshape)
[x for x in dir(expr) if not x.startswith('_')]
float64

Out[47]:
['apply',
 'axis',
 'coerce',
 'dshape',
 'fields',
 'isidentical',
 'isnan',
 'keepdims',
 'label',
 'map',
 'relabel',
 'schema',
 'symbol',
 'truncate',
 'utcfromtimestamp']

It's really just a difference in taste, but I prefer to have all my valid functions available as instance methods so that I can compose function chains more easily as seen above.

I wanted to try to reproduce examples on top of SQLite from my last blog post but I found too much missing functionality, to name a couple specific things:

  • case statements / if-else expressions. This also makes bucket not possible to implement
  • fillna function

Also, the way that joins work in Blaze is very un-SQL like (replicates pandas, in essence). This makes many SQL queries with joins hard to port. Check out the Ibis tutorial for more on that.