Ibis Project Blog

Python productivity framework for the Apache Hadoop ecosystem. Development updates, use cases, and internals.

Quickstart on Crunchbase analysis using Ibis and SQLite

The latest 0.5 version of Ibis supports executing expressions on data in SQLite database files. Since SQLite ships with Python, this is the easiest way to get up and running using the project because you don't have to install or administer a database at all.

To get going, install Ibis if you have not already

pip install ibis-framework

And download the Crunchbase dataset found on the data collection page.

Now fire up a IPython / Jupyter notebook or shell and let's get started!

To get started, import ibis, set it to interactive mode, and open the SQLite file you just downloaded like so:

In [1]:
import ibis
ibis.options.interactive = True

con = ibis.sqlite.connect('crunchbase.db')

The list_tables method shows you what tables are in the file:

In [2]:
con.list_tables()
Out[2]:
[u'acquisitions', u'companies', u'investments', u'rounds']

Let's look at the rounds table:

In [3]:
rounds = con.table('rounds')
rounds.info()
Table rows: 87161

Column                   Type    Non-null #
------                   ----    ----------
company_permalink        string  87161     
company_name             string  87161     
company_category_list    string  83464     
company_market           string  80523     
company_country_code     string  81291     
company_state_code       string  59296     
company_region           string  81016     
company_city             string  80208     
funding_round_permalink  string  87161     
funding_round_type       string  87161     
funding_round_code       string  23648     
funded_at                string  87161     
funded_month             string  87147     
funded_quarter           string  87147     
funded_year              int64   87147     
raised_amount_usd        double  73406     

We can start to explore this data set using pandas-like syntax:

In [4]:
rounds.funding_round_type.value_counts()
Out[4]:
       funding_round_type  count
0                   angel   4602
1        convertible_note    838
2          debt_financing   5869
3     equity_crowdfunding   2401
4                   grant   1523
5   non_equity_assistance     69
6           post_ipo_debt     92
7         post_ipo_equity    425
8          private_equity   1864
9    product_crowdfunding    241
10       secondary_market     61
11                   seed  22053
12            undisclosed   4128
13                venture  42995

There's too little space here to give a comprehensive tutorial on the Ibis DSL, but it's designed for composability and operation chaining even more so than pandas.

For example, we could find out how many acquisitions had a known price as follows

In [5]:
acquisitions = con.table('acquisitions')
expr = (acquisitions.price_amount
        .isnull()
        .name('has_price')
        .value_counts())
expr
Out[5]:
   has_price  count
0          0   3816
1          1  51424

If you call the execute method on the expression, you'll get a pandas object back:

In [6]:
df = expr.execute()
df
Out[6]:
has_price count
0 0 3816
1 1 51424

Ibis has three basic data types:

  • Tables
  • Arrays
  • Scalars

Based on the type of the expression, you'll get back a pandas or Python object that is an in-memory counterpart

In [7]:
type(expr)
Out[7]:
ibis.expr.types.TableExpr
In [8]:
companies = con.table('companies')

expr = companies.funding_total_usd.mean()
type(expr)
Out[8]:
ibis.expr.types.DoubleScalar
In [9]:
expr.execute()
Out[9]:
16198405.416388474

A few funding metrics

Let's suppose we were interested in looking at the number of coded (Series A, Series B, etc.) venture investments in the dataset for each year.

First, a technical detail. SQLite stores all date and time values as strings, but Ibis lets you cast strings to timestamps, handling the rest of the details for you

In [10]:
funded_at = rounds.funded_at.cast('timestamp')
funded_at.year().value_counts()
Out[10]:
    unnamed  count
0         1      3
1         7      1
2        11      1
3        20      2
4        29      1
5        30      1
6       201      1
7       203      1
8       214      1
9      1921      1
10     1960      3
11     1973      1
12     1974      2
13     1979      1
14     1982      3
15     1983      1
16     1984      4
17     1985      6
18     1986      4
19     1987      6
20     1988      2
21     1989      4
22     1990     14
23     1991     10
24     1992     12
25     1993     15
26     1994     15
27     1995     21
28     1996     32
29     1997     30
30     1998     62
31     1999    178
32     2000    267
33     2001    200
34     2002    207
35     2003    271
36     2004    434
37     2005   1762
38     2006   2663
39     2007   3562
40     2008   4257
41     2009   6553
42     2010   8692
43     2011  10550
44     2012  12648
45     2013  16194
46     2014  17670
47     2015    790
48     2104      1
49     2105      1

This looks like some bad data, so I'll restrict the analysis to 2000 to 2015. Also, there are NULL values in funding_round_code:

In [11]:
rounds.funding_round_code.value_counts()
Out[11]:
  funding_round_code  count
0               None  63513
1                  A  11382
2                  B   6548
3                  C   3329
4                  D   1530
5                  E    608
6                  F    201
7                  G     45
8                  H      5
In [12]:
year = funded_at.year().name('year')

expr = (rounds[(rounds.funding_round_type == 'venture') &
               year.between(2000, 2015) &
               rounds.funding_round_code.notnull()]
        .group_by([year, 'funding_round_code'])
        .size())

results = expr.execute()
results[:10]
Out[12]:
year funding_round_code count
0 2000 A 79
1 2000 B 32
2 2000 C 10
3 2000 D 8
4 2000 E 2
5 2001 A 50
6 2001 B 30
7 2001 C 17
8 2001 D 4
9 2001 E 1

Since results is now a pandas DataFrame, it can be reshaped accordingly:

In [13]:
pivoted = (results.set_index(['year', 'funding_round_code'])
           .unstack('funding_round_code')
           .fillna(0))
pivoted
Out[13]:
count
funding_round_code A B C D E F G H
year
2000 79 32 10 8 2 0 0 0
2001 50 30 17 4 1 0 0 0
2002 35 39 25 5 2 2 0 0
2003 68 45 14 13 1 0 1 0
2004 146 76 32 15 3 1 0 0
2005 513 334 176 67 26 6 0 0
2006 717 465 226 91 35 7 1 0
2007 956 583 281 110 49 7 1 0
2008 979 653 308 120 54 17 1 0
2009 753 531 290 147 55 28 0 0
2010 1013 598 369 149 52 18 2 0
2011 1250 700 334 175 60 18 5 0
2012 1242 610 345 184 69 16 7 0
2013 1606 796 377 185 81 38 6 0
2014 1757 952 471 223 108 36 18 5
2015 88 71 34 28 8 5 3 0

Cool. Now let's look at total funding according to some arbitrary buckets. I'll use the bucket method which computes a discrete histogram according to bin edges you provide:

In [14]:
funding_buckets = [0, 1000000, 10000000, 50000000, 100000000, 
                   500000000, 1000000000]

bucket = (companies
          .funding_total_usd
          .bucket(funding_buckets, include_over=True))
bucket.value_counts()
Out[14]:
   unnamed  count
0      NaN  12055
1        0  15965
2        1  15754
3        2   7839
4        3   1532
5        4   1022
6        5     88
7        6     37

These are bucket numbers. I can provide names using the label method:

In [15]:
bucket_names = ['0 to 1m', '1m to 10m', '10m to 50m', 
                '50m to 100m', '100m to 500m',
                '500m to 1b', 'Over 1b']

counts = bucket.name('bucket').value_counts()
labeled = counts.bucket.label(bucket_names)
with_names = counts.mutate(bucket_name=labeled)
with_names
Out[15]:
   bucket  count   bucket_name
0     NaN  12055          None
1       0  15965       0 to 1m
2       1  15754     1m to 10m
3       2   7839    10m to 50m
4       3   1532   50m to 100m
5       4   1022  100m to 500m
6       5     88    500m to 1b
7       6     37       Over 1b

If you were interested in statistics: other than the count, you can use group_by and aggregate to compute multiple group summaries:

In [16]:
metrics = (companies.group_by(bucket.name('bucket'))
           .aggregate(count=companies.count(),
                      total_funding=companies.funding_total_usd.sum())
           .mutate(bucket_name=lambda x: x.bucket.label(bucket_names)))
metrics
Out[16]:
   bucket  count  total_funding   bucket_name
0     NaN  12055            NaN          None
1       0  15965   4.505177e+09       0 to 1m
2       1  15754   5.712283e+10     1m to 10m
3       2   7839   1.724166e+11    10m to 50m
4       3   1532   1.054132e+11   50m to 100m
5       4   1022   1.826600e+11  100m to 500m
6       5     88   5.804196e+10    500m to 1b
7       6     37   1.040123e+11       Over 1b

I'll embellish this by looking at the operation status of companies founded 2010 onward

In [17]:
joined = (companies.mutate(bucket=bucket,
                           status=companies.status.fillna('Unknown'))
          [(companies.founded_at > '2010-01-01') |
           companies.founded_at.isnull()]
          .group_by(['bucket', 'status'])
          .size()
          .mutate(bucket_name=lambda x: (x.bucket.label(bucket_names)
                                         .fillna('Unknown'))))

table = joined.execute()

table.set_index(['status', 'bucket', 'bucket_name'])['count'].unstack('status')
Out[17]:
status Unknown acquired closed operating
bucket bucket_name
NaN Unknown 3208 217 251 5731
0 0 to 1m 353 227 632 11195
1 1m to 10m 197 530 369 8166
2 10m to 50m 62 276 124 2572
3 50m to 100m 6 47 20 360
4 100m to 500m 16 14 11 255
5 500m to 1b NaN 2 1 23
6 Over 1b NaN 1 NaN 9

To explain what all these methods are:

  • mutate adds columns to a table
  • group_by creates a grouped table expression that can be aggregated, windowed, or a few other actions
  • .size() is a shortcut for .aggregate(companies.count())
  • fillna replaces NULL values with the indicated fill value

To learn more, check out the tutorial or the API reference. Much more documentation will get written over the coming months, but this gives you an idea of the kind of analysis that's made very easy with Ibis once you learn the DSL.

By the way, the SQL code that was effectively generated for SQLite by the last expression looks something like this:

In [18]:
print(ibis.impala.compile(joined))
SELECT *,
  isnull(CASE `bucket`
    WHEN 0 THEN '0 to 1m'
    WHEN 1 THEN '1m to 10m'
    WHEN 2 THEN '10m to 50m'
    WHEN 3 THEN '50m to 100m'
    WHEN 4 THEN '100m to 500m'
    WHEN 5 THEN '500m to 1b'
    WHEN 6 THEN 'Over 1b'
    ELSE NULL
  END, 'Unknown') AS `bucket_name`
FROM (
  SELECT `bucket`, `status`, count(*) AS `count`
  FROM (
    SELECT `permalink`, `name`, `homepage_url`, `category_list`, `market`,
           `funding_total_usd`, isnull(`status`, 'Unknown') AS `status`,
           `country_code`, `state_code`, `region`, `city`, `funding_rounds`,
           `founded_at`, `founded_month`, `founded_quarter`, `founded_year`,
           `first_funding_at`, `last_funding_at`,
      CASE
        WHEN (`funding_total_usd` >= 0) AND (`funding_total_usd` < 1000000) THEN 0
        WHEN (`funding_total_usd` >= 1000000) AND (`funding_total_usd` < 10000000) THEN 1
        WHEN (`funding_total_usd` >= 10000000) AND (`funding_total_usd` < 50000000) THEN 2
        WHEN (`funding_total_usd` >= 50000000) AND (`funding_total_usd` < 100000000) THEN 3
        WHEN (`funding_total_usd` >= 100000000) AND (`funding_total_usd` < 500000000) THEN 4
        WHEN (`funding_total_usd` >= 500000000) AND (`funding_total_usd` <= 1000000000) THEN 5
        WHEN `funding_total_usd` > 1000000000 THEN 6
        ELSE NULL
      END AS `bucket`
    FROM companies
    WHERE (`founded_at` > '2010-01-01') OR `founded_at` IS NULL
  ) t1
  GROUP BY 1, 2
) t0

Bet you are glad you didn't have to write it by hand!

Acquisition rate by investor

Suppose you were interested in what percentage of a particular venture capital firm's companies eventually get acquired. My friends at Mode Analytics solved this problem in SQL and I reproduce that query here:

SELECT *,
       acquisitions_ipos/number_of_investments::FLOAT AS acquitision_rate
  FROM (
SELECT CASE WHEN i.investor_name IS NULL THEN 'NO INVESTOR'
            ELSE i.investor_name END "Investor name",
        COUNT(DISTINCT c.permalink) AS number_of_investments,
        COUNT(DISTINCT
        CASE WHEN c.status IN ('ipo', 'acquired') THEN c.permalink END) 
        AS acquisitions_ipos
       FROM tutorial.crunchbase_companies c
       LEFT JOIN tutorial.crunchbase_investments i
       ON c.permalink = i.company_permalink
       GROUP BY 1
       ORDER BY 2 DESC
     ) a
LIMIT 100

Let's reproduce this piece by piece with Ibis. A few things first

In [19]:
i = con.table('investments')
c = con.table('companies')

clean_name = i.investor_name.fillna('NO INVESTOR').name('investor_name')
num_investments = c.permalink.nunique()

exited = c.status.isin(['ipo', 'acquired']).ifelse(c.permalink, ibis.NA)
num_exits = exited.nunique()

These are just the basic statistics. Now let's join the tables and compute the acquisition rate:

In [22]:
stats = (c.left_join(i, c.permalink == i.company_permalink)
         .group_by(clean_name)
         .aggregate(num_investments=num_investments, 
                    num_exits=num_exits))

stats = (stats.mutate(acq_rate=(stats.num_exits / 
                                stats.num_investments.cast('float'))))
stats.limit(10)
Out[22]:
          investor_name  num_exits  num_investments  acq_rate
0                     .          0                1  0.000000
1         .406 Ventures          4               27  0.148148
2            01Startups          0                1  0.000000
3     1-800-FLOWERS.COM          0                1  0.000000
4             10 Others          0                1  0.000000
5          101 Startups          0                5  0.000000
6       105 Capital LLC          0                1  0.000000
7           10X Fund LP          0                1  0.000000
8  10X Venture Partners          0                2  0.000000
9           10Xelerator          1               21  0.047619

Okay. Now let's look at the most successful investors who have made at least 100 investments:

In [23]:
great_success = (stats
                 [stats.num_investments > 100]
                 .sort_by(ibis.desc('acq_rate')))
top20 = great_success.limit(20)

top20
Out[23]:
                     investor_name  num_exits  num_investments  acq_rate
0                       Ron Conway         52              119  0.436975
1                  Summit Partners         94              260  0.361538
2                 Trinity Ventures         42              125  0.336000
3                 Felicis Ventures         56              173  0.323699
4                  Austin Ventures         36              113  0.318584
5                   Menlo Ventures         54              171  0.315789
6                    Mayfield Fund         40              131  0.305344
7                              CRV         51              168  0.303571
8               Foundation Capital         41              137  0.299270
9                        Benchmark         52              180  0.288889
10  Norwest Venture Partners - NVP         41              144  0.284722
11                 Shasta Ventures         28              101  0.277228
12                SoftBank Capital         35              127  0.275591
13                        SV Angel        139              516  0.269380
14                    DAG Ventures         32              119  0.268908
15               Greylock Partners         62              231  0.268398
16               Ignition Partners         30              112  0.267857
17                         Venrock         40              153  0.261438
18                   True Ventures         39              152  0.256579
19                 Canaan Partners         37              146  0.253425

Have fun with this data and look out for more blog posts on it to illustrate Ibis features.