Ibis @ LinkedIn

Portable Python DataFrames

Chloe He
Phillip Cloud

Who

Phillip Cloud

  • Principal engineer at Voltron Data
  • Python analytics for 10+ years
  • Open source
  • Tech lead for Ibis

Chloe He

  • Founding engineer at Claypot β†’ Senior engineer at Voltron Data
  • Infrastructure for real-time ML
  • Ibis streaming

What

Ibis is a Python library for:

  • Exploratory data analysis
  • General analytics
  • Data engineering
  • ML preprocessing
  • Library: (e.g., Google BigFrames)
  • Build your own … Ibis (??)

πŸ’‘ Development to production with few rewrites

Examples

Examples

https://ibis-project.org/tutorials/getting_started

from ibis.interactive import *

penguins = ibis.examples.penguins.fetch()
penguins
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ year  ┃
┑━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
β”‚ string  β”‚ string    β”‚ float64        β”‚ float64       β”‚ int64             β”‚ int64       β”‚ string β”‚ int64 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Adelie  β”‚ Torgersen β”‚           39.1 β”‚          18.7 β”‚               181 β”‚        3750 β”‚ male   β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           39.5 β”‚          17.4 β”‚               186 β”‚        3800 β”‚ female β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           40.3 β”‚          18.0 β”‚               195 β”‚        3250 β”‚ female β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           NULL β”‚          NULL β”‚              NULL β”‚        NULL β”‚ NULL   β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           36.7 β”‚          19.3 β”‚               193 β”‚        3450 β”‚ female β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           39.3 β”‚          20.6 β”‚               190 β”‚        3650 β”‚ male   β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           38.9 β”‚          17.8 β”‚               181 β”‚        3625 β”‚ female β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           39.2 β”‚          19.6 β”‚               195 β”‚        4675 β”‚ male   β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           34.1 β”‚          18.1 β”‚               193 β”‚        3475 β”‚ NULL   β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           42.0 β”‚          20.2 β”‚               190 β”‚        4250 β”‚ NULL   β”‚  2007 β”‚
β”‚ …       β”‚ …         β”‚              … β”‚             … β”‚                 … β”‚           … β”‚ …      β”‚     … β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
penguins.group_by("species", "island").agg(
    n=penguins.count(),
    avg_bill_mm=penguins.bill_length_mm.mean(),
    med_flipper_mm=penguins.flipper_length_mm.median()
)
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ species   ┃ island    ┃ n     ┃ avg_bill_mm ┃ med_flipper_mm ┃
┑━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
β”‚ string    β”‚ string    β”‚ int64 β”‚ float64     β”‚ float64        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Adelie    β”‚ Torgersen β”‚    52 β”‚   38.950980 β”‚          191.0 β”‚
β”‚ Adelie    β”‚ Biscoe    β”‚    44 β”‚   38.975000 β”‚          189.5 β”‚
β”‚ Adelie    β”‚ Dream     β”‚    56 β”‚   38.501786 β”‚          190.0 β”‚
β”‚ Gentoo    β”‚ Biscoe    β”‚   124 β”‚   47.504878 β”‚          216.0 β”‚
β”‚ Chinstrap β”‚ Dream     β”‚    68 β”‚   48.833824 β”‚          196.0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
cols = {
    c: penguins[c] - penguins[c].mean()
    for c in penguins.columns
    if penguins[c].type().is_numeric() and c != "year"
}
expr = penguins.group_by("species").mutate(**cols).head(5)
expr
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ year  ┃
┑━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
β”‚ string  β”‚ string    β”‚ float64        β”‚ float64       β”‚ float64           β”‚ float64     β”‚ string β”‚ int64 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Adelie  β”‚ Torgersen β”‚       2.108609 β”‚     -1.546358 β”‚          1.046358 β”‚   -0.662252 β”‚ female β”‚  2008 β”‚
β”‚ Adelie  β”‚ Biscoe    β”‚       3.908609 β”‚     -0.046358 β”‚          6.046358 β”‚  374.337748 β”‚ male   β”‚  2009 β”‚
β”‚ Adelie  β”‚ Dream     β”‚       1.008609 β”‚      0.753642 β”‚         -5.953642 β”‚  949.337748 β”‚ male   β”‚  2007 β”‚
β”‚ Adelie  β”‚ Dream     β”‚      -2.291391 β”‚     -0.346358 β”‚         -7.953642 β”‚ -550.662252 β”‚ female β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚      -2.091391 β”‚      0.953642 β”‚          3.046358 β”‚ -250.662252 β”‚ female β”‚  2007 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Let’s talk about SQL

SQL

cols = {
    c: penguins[c] - penguins[c].mean()
    for c in penguins.columns
    if penguins[c].type().is_numeric() and c != "year"
}
expr = penguins.group_by("species").mutate(**cols).head(5)
ibis.to_sql(expr)
SELECT
  "t0"."species",
  "t0"."island",
  "t0"."bill_length_mm" - AVG("t0"."bill_length_mm") OVER (PARTITION BY "t0"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "bill_length_mm",
  "t0"."bill_depth_mm" - AVG("t0"."bill_depth_mm") OVER (PARTITION BY "t0"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "bill_depth_mm",
  "t0"."flipper_length_mm" - AVG("t0"."flipper_length_mm") OVER (PARTITION BY "t0"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "flipper_length_mm",
  "t0"."body_mass_g" - AVG("t0"."body_mass_g") OVER (PARTITION BY "t0"."species" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "body_mass_g",
  "t0"."sex",
  "t0"."year"
FROM "penguins" AS "t0"
LIMIT 5

Back to examples…

ibis-analytics

https://ibis-analytics.streamlit.app

Why?

DataFrame lore

  • DataFrames appear in the S programming language, which evolves into the R calculator programming language.
  • pandas perfects the DataFrame in Python … or did it?
  • Dozens of Python DataFrame libraries appear and disappear…
  • pandas is the de facto standard for Python DataFrames. It still doesn’t scale.
  • Leads to data scientists throwing code over the wall to engineers.
  • What if Ibis were a new standard?

Ibis origins

from Apache Arrow and the β€œ10 Things I Hate About pandas” by Wes McKinney

…in 2015, I started the Ibis project…to create a pandas-friendly deferred expression system for static analysis and compilation [of] these types of [query planned, multicore execution] operations. Since an efficient multithreaded in-memory engine for pandas was not available when I started Ibis, I instead focused on building compilers for SQL engines (Impala, PostgreSQL, SQLite), similar to the R dplyr package. Phillip Cloud from the pandas core team has been actively working on Ibis with me for quite a long time.

Two world problem

What does Ibis solve?

SQL:

  • databases & tables
  • analytics
  • metrics
  • dashboards

Python:

  • files & DataFrames
  • data science
  • statistics
  • notebooks

Ibis bridges the gap.

Bridging the gap

import ibis

con = ibis.duckdb.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

An embeddable, zero-dependency, C++ SQL database engine.

import ibis

con = ibis.datafusion.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A Rust SQL query engine.

import ibis

con = ibis.clickhouse.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A C++ column-oriented database management system.

import ibis

con = ibis.polars.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A Rust DataFrame library.

import ibis

con = ibis.bigquery.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A serverless, highly scalable, and cost-effective cloud data warehouse.

import ibis

con = ibis.snowflake.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A cloud data platform.

import ibis

con = ibis.oracle.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A relational database management system.

import ibis

con = ibis.pyspark.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A unified analytics engine for large-scale data processing.

import ibis

con = ibis.trino.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A distributed SQL query engine.

import ibis

con = ibis.flink.connect()
penguins = con.table("penguins")
penguins.group_by("species", "island").agg(count=ibis._.count())

A distributed streaming and batch SQL analytics engine.

How does Ibis work?

This is going to be very fast πŸƒπŸ’¨. I will happily answer questions about it πŸ˜‚.

Engineering themes

  • Immutability
  • Type checking
  • Separation of concerns
  • Extensibility
  • Focus on end-user experience
  • Avoid common denominator trap
  • Scale up and down

Components: expressions

Expressions: interface

  • StringScalar, IntegerColumn, Table
  • .sum(), .split(), .join()
  • No knowledge of specific operation

Operations: implementation

  • Specific action: e.g., StringSplit
  • Inputs + output dtype, shape
  • Used for compiler dispatch

Other goodies

  • Type system
  • Pattern matching
  • Graph manipulation/traversal

Goal: separate API from implementation.

Components: expressions

from ibis.expr.visualize import to_graph

expr = penguins.group_by("species").agg(
    avg_bill_mm=_.bill_length_mm.mean()
)

to_graph(expr)

Components: compiler

expr = penguins.group_by("species").agg(
    avg_bill_mm=_.bill_length_mm.mean()
)

graph BT
  classDef white color:white;

  %% graph definition
  DatabaseTable --> species
  DatabaseTable --> bill_length_mm
  bill_length_mm --> Mean
  species --> Aggregate
  Mean --> Aggregate

  %% style
  class DatabaseTable white;
  class species white;
  class bill_length_mm white;
  class Mean white;
  class Aggregate white;

graph BT
  classDef white color:white;

  DatabaseTable2[DatabaseTable] --> species2[species]
  species2 --> bill_length_mm2[bill_length_mm]
  bill_length_mm2 --> Mean2[Mean]
  Mean2 --> Aggregate2[Aggregate]

  %% style
  class DatabaseTable2 white;
  class species2 white;
  class bill_length_mm2 white;
  class Mean2 white;
  class Aggregate2 white;

Components: compiler

  • Rewrite operations
  • Bottom up compile storing intermediate outputs
  • Handoff output to sqlglot
SELECT
  "t0"."species",
  AVG("t0"."bill_length_mm") AS "avg_bill_mm"
FROM "penguins" AS "t0"
GROUP BY
  1

Components: drivers

Drivers

  • We have SQL at this point
  • Send to DB via DBAPI: cursor.execute(ibis_generated_sql)
  • (Heavily) massage the output

Ibis + Streaming

Growth of streaming

  • Over 70% of Fortune 500 companies have adopted Kafka
  • 54% of Databricks’ customers are using Spark Structured Streaming
  • The stream processing market is expected to grow at a compound annual growth rate (CAGR) of 21.5% from 2022 to 2028 (IDC)

Batch and streaming

graph LR
  subgraph " "
    direction LR
    A[data] --> B[batch processing] & C[stream processing] --> D[downstream]
  end

In the machine learning world…

graph TB
  proddata --> sampled
  model --> prodpipeline

  subgraph "local env"
    sampled[sampled data] --> local[local experimentation]
    local <--> iterate
    local --> model[finally, we have a production-ready model!]
  end

  subgraph "prod env"
    proddata[production data] --> prodpipeline[production pipelines]
  end

In the machine learning world…

graph TB
  proddata --> sampled
  model -- "code rewrite" --> prodpipeline

  linkStyle 1 color:white;

  subgraph "local env"
    sampled[sampled data] --> local[local experimentation]
    local <--> iterate
    local --> model[finally, we have a production-ready model!]
  end

  subgraph "prod env"
    proddata[production data] --> prodpipeline[production pipelines]
  end

A real-world example

pandas

return (
    clicks_df
    .groupby(["user"])
    .rolling("1h")
    .size()
)

Flink SQL

SELECT
  user,
  COUNT(url) OVER (
    PARTITION BY user
    ORDER BY click_time
    RANGE BETWEEN
      INTERVAL '1' HOUR PRECEDING
      AND CURRENT ROW
  ) AS one_hour_user_click_cnt
FROM clicks

Code rewrites

  • From batch to streaming
  • From local experimentation to production
  • Backfilling a streaming feature on a batch backing table
  • …

The solution…

Stream-batch unified API

  • Flink SQL
  • Spark DataFrame API
  • …

Stream-batch unification

pandas

return (
    clicks_df
    .groupby(["user"])
    .rolling("1h")
    .size()
)

Flink SQL

SELECT
  user,
  COUNT(url) OVER (
    PARTITION BY user
    ORDER BY click_time
    RANGE BETWEEN
      INTERVAL '1' HOUR PRECEDING
      AND CURRENT ROW
  ) AS one_hour_user_click_cnt
FROM clicks

Ibis

agged = clicks.select(
    _.user,
    one_hour_user_click_cnt=_.url.count().over(
        range=(-ibis.interval(hour=1), 0),
        group_by=_.user,
        order_by=_.click_time,
    ),
)

But it’s hard…

  • Streaming is different
    • Time semantics
    • Long-running queries
    • Sources and sinks
    • …
  • Less established standards in streaming syntax

Ibis streaming today

  • Flink backend and RisingWave backend launched in Ibis 8.0
  • Introduction of watermarks, windowed aggregations, etc in Ibis

What’s next?

  • Expand support of streaming operations and syntax
  • Continuously iterate on a stream-batch unified API
  • More streaming backends (Spark Structured Streaming)

Towards composable data systems

Try it out now!

Install:

pip install 'ibis-framework[duckdb]'

Then run:

from ibis.interactive import *

t = ibis.examples.penguins.fetch()
t.head()
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ year  ┃
┑━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
β”‚ string  β”‚ string    β”‚ float64        β”‚ float64       β”‚ int64             β”‚ int64       β”‚ string β”‚ int64 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Adelie  β”‚ Torgersen β”‚           39.1 β”‚          18.7 β”‚               181 β”‚        3750 β”‚ male   β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           39.5 β”‚          17.4 β”‚               186 β”‚        3800 β”‚ female β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           40.3 β”‚          18.0 β”‚               195 β”‚        3250 β”‚ female β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           NULL β”‚          NULL β”‚              NULL β”‚        NULL β”‚ NULL   β”‚  2007 β”‚
β”‚ Adelie  β”‚ Torgersen β”‚           36.7 β”‚          19.3 β”‚               193 β”‚        3450 β”‚ female β”‚  2007 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Questions?

Where to find us