Ibis Project Blog

Python productivity framework for the Apache Hadoop ecosystem. Development updates, use cases, and internals.

Announcing hs2client, a fast new C++ / Python Thrift client for Impala and Hive

This year, I collaborated with members of the Apache Impala (incubating) team at Cloudera to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Apache Hive and Impala.

We are excited to release this effort, dubbed hs2client, as a new Apache-licensed open source project on GitHub. As you may guess from the name, this library implement the HiveServer2 Thrift API as a C++ library, with careful handling of result sets to allow languages like Python to access data at high performance.

The initial alpha preview release contains:

  • A C++ library, libhs2client, which provides a clean C++ API for the HiveServer2 Thrift API. This can be built and dynamically or statically linked in C/C++ applications with no direct exposure to Apache Thrift.

  • Python bindings, with optimized reads to pandas.DataFrame

To try out the library, you can install a dev build of the project right now with conda:

conda install hs2client -c cloudera/channel/dev

Optimizing for speed and memory use

The Python bindings for libhs2client are careful about performance and memory in a few key ways:

  • Converting row batches directly into pandas-compatible NumPy arrays
  • Care with categorical data: string data is deduplicated while it's being converted for pandas. In the future, it would be straightforward to add an option to return all string data as pandas.Categorical arrays

hs2client demo and simple benchmark

The initial API is oriented at modeling the HiveServer2 protocol closely. Connecting to a cluster yields a Service instance:

In [1]:
import hs2client

service = hs2client.connect('localhost', 21050, 'wesm')

To execute queries, you open a session then use its execute method.

In [2]:
session = service.open_session()

op = session.execute("select 1")

Queries are asynchronous by default.

In [3]:
op.is_finished
Out[3]:
True
In [4]:
op.get_state()
Out[4]:
'finished'

Result sets can be fetch currently to pandas.DataFrame objects:

In [5]:
df = op.fetchall_pandas()
In [6]:
df
Out[6]:
1
0 1

Compared with comparable database clients, we've optimized the fetch path to have excellent performance (IO speeds) and memory use.

In [10]:
import numpy as np
import pandas as pd

N, K = 1000000, 10

df = pd.DataFrame(np.random.randn(N, K),
                  columns=['data{0}'.format(i) for i in range(K)])

import ibis
con = ibis.impala.connect('localhost', 
                          hdfs_client=ibis.hdfs_connect(port=5070))
con.drop_database('hs2_test', force=True)
con.create_database('hs2_test')
con.create_table('test_data', df, database='hs2_test')
In [16]:
op = session.execute_sync('select * from hs2_test.test_data')
%time df = op.fetchall_pandas()
CPU times: user 200 ms, sys: 64 ms, total: 264 ms
Wall time: 682 ms

In [21]:
speed = df.memory_usage().sum() / 0.682 / 2**20
print('Speed: {0:.2f} MB/s'.format(speed))
Speed: 111.87 MB/s

This benchmark is being run on localhost, but it shows we can move over 110 MB/s through the Thrift protocol and convert to a fully-formed pandas DataFrame.

hs2client compared with impyla

In Ibis and other Python projects, we have been using impyla to execute queries and access result sets. impyla either uses Apache Thrift's official Python implementation (on Python 2) or thriftpy (on Python 3) for interacting with the Impala or Hive Thrift service. Because of this, the main difference is the performance in fetching large result sets.

Here, I will perform an equivalent fetch using impyla via ibis:

In [29]:
# Connect to Impala via impyla (on Python 2)
con = ibis.impala.connect('localhost', port=21050)

expr = con.sql('select * from hs2_test.test_data')

%time df = expr.execute(limit=None)
CPU times: user 1.15 s, sys: 52 ms, total: 1.2 s
Wall time: 1.66 s

In [30]:
speed = df.memory_usage().sum() / 1.66 / 2**20
print('Speed: {0:.2f} MB/s'.format(speed))
Speed: 45.96 MB/s

In this particular use case with an all-numeric table (which is ideal for deserialization performance), the speed difference is only 2x or a bit more.

hs2client roadmap

The code is hosted on GitHub.

The C++ library does not yet implement some important features that are needed to be a drop-in replacement for impyla or other Hive or Impala drivers:

  • SSL transport (with certificate verification)
  • SASL Thrift transport for secure (i.e. Kerberos) clusters, or insecure clusters using SASL

Note that many Hive clusters, even unsecured ones, use the SASL transport, so using the library with Hive right now may be a challenge until that is completed.

On the Python side, we must implement a DB API 2.0 compatibility layer, since currently data can only be fetched to pandas, not Python tuples as with most Python database drivers.

We of course welcome contributions from the community to build out some of these features.