Ibis Project Blog

Python analysis framework for SQL engines and Hadoop, focused on user productivity. Development updates, use cases, and internals.

Interactive Analytics on Dynamic Big Data in Python using Kudu, Impala, and Ibis

The new Apache Kudu (incubating) columnar storage engine together with Apache Impala (incubating) interactive SQL engine enable a new fully open source big data architecture for data that is arriving and changing very quickly. By integrating Kudu and Impala with Ibis, this functionality is now available to Python programmers with an easy-to-use pandas-like API.

I spent this last week expanding the Kudu Python client (a Cython wrapper for the C++ client API) and adding initial integration with Ibis. While my Kudu patch is still in code review, I will give you a preview here of how it all works.

Since Kudu, a native C++ storage engine, now builds on OS X, I'm writing this blog using the Kudu client on OS X, so now is a great time for developers on both OS X and Linux to get involved with developing Python tools for Kudu.

Using Impala on OS X is in the works, see the issue tracker for more. As soon as we can, we'll provide a combined Kudu/Impala DMG installer for installing the environment locally on Mac computers.

Read on for more about Kudu and a full stack demo.

Ibis 0.6: Expanded Impala support, better pandas integration

Ibis 0.6.0 has been released! This release brings substantial new functionality for Impala users; see the Using Ibis with Impala guide for a deep dive on these features. These enable Ibis to solve many data management problems beyond the analytical SQL capabilities exposed through Ibis expressions.

Some highlights of the new functionality include:

  • Create tables and insert into existing tables directly from pandas DataFrame objects
  • Inspect a variety of table and partition metadata in a user-friendly manner
  • Comprehensive partitioned table support
  • Modify table and partition metadata
  • Set session options such as Parquet compression codec

Several bugs were fixed as well.

The upcoming roadmap will feature Impala's nested types support and expanded SQL system coverage.

Ibis for SQL Programmers Guide

Yesterday, on the Ibis documentation I published the first version of Ibis for SQL Programmers, a practical guide to how SQL concepts map onto the Ibis DSL. To be clear, all of your data will stay in a SQL engine, but you can interact with it using Ibis using Python code that can be simpler to write and also more composable and reusable. Also, Ibis statically analyzes types and expressions as you go, catching a large class of common mistakes made during SQL programming.

This will be an evolving document, so if you find something that you use commonly in SQL and would like to see it added, please reach out to us on the project issue tracker.

Ibis Design: Expression types and operation mechanics

This blog post describes Ibis's expression type system and how its DSL (domain-specific language) operations work. It's intended for curious users or developers of the project.

I spent a lot of time thinking about how to create a deferred expression DSL that compiles to SQL and is fluid and intuitive to use. No small task.

The key things I want to address are:

  • Ibis's internal operation types and external user-facing types
  • How the DSL semantics help you avoid making common mistakes
  • How type checking and evaluation works

At the end, since several people have asked me about it, I show some contrasts in Ibis's API design with Blaze, another Python project, that has its own embedded DSL.

Quickstart on Crunchbase analysis using Ibis and SQLite

The latest 0.5 version of Ibis supports executing expressions on data in SQLite database files. Since SQLite ships with Python, this is the easiest way to get up and running using the project because you don't have to install or administer a database at all.

To get going, install Ibis if you have not already

pip install ibis-framework

And download the Crunchbase dataset found on the data collection page.

Now fire up a IPython / Jupyter notebook or shell and let's get started!

Ibis 0.5: SQLite, Python 3, and more

The next Ibis release is out, with some major new functionality:

  • SQLite client and support for most SQLite built-in functions
  • Python 3 compatibility (single codebase)
  • SQLAlchemy-based expression translation toolchain to enable easier internal code reuse amongst SQL engines and pave road for PostgreSQL, Redshift, Vertica, and other analytic SQL engine support in the near future.
  • Asynchronous query execution API (expr.execute(async=True)) for Impala supporting query status and cancellation. This is very helpful in building multithreaded applications.
  • Support for using Impala user-defined aggregate (UDA) functions

There's a lot more, of course. Check out the detailed release notes, and read on for more about the upcoming roadmap.

Check out this follow-up post for a quick start in using Ibis on SQLite with the newly posted Crunchbase dataset.

Install Ibis from PyPI with

pip install ibis-framework

Ibis expressions: Pythonic composability and evaluation

The R community, especially my friend Hadley Wickham, has done pioneering work in creating expressive and useful domain-specific languages (DSLs) embedded in the R programming language. This articles gives some technical detail on composability and evaluation semantics in R and related work in Ibis to solve the same kinds of problems in a Pythonic way.

Leveraging SQL window functions in Ibis

Window (also known as analytic) functions are a valuable technique in analytic SQL, but unfortunately they are generally considered to be an advanced skill among SQL programmers. Conceptually, they are relatively simple, and indeed many everyday pandas and R operations can be expressed in SQL through their use. Mechanically, they can be difficult to use, largely because of the SQL syntax.

Ibis has had comprehensive support for window functions from 0.3 onward, and I invested quite a bit of effort to design an API to make them available to users in a much simpler way. I also made sure that you don't have to be a SQL expert to use them.