Skip to content

Commit bc9d7e0

Browse files
David KnuppImpala Public Jenkins
David Knupp
authored and
Impala Public Jenkins
committed
IMPALA-3343, IMPALA-9489: Make impala-shell compatible with python 3.
This is the main patch for making the the impala-shell cross-compatible with python 2 and python 3. The goal is wind up with a version of the shell that will pass python e2e tests irrepsective of the version of python used to launch the shell, under the assumption that the test framework itself will continue to run with python 2.7.x for the time being. Notable changes for reviewers to consider: - With regard to validating the patch, my assumption is that simply passing the existing set of e2e shell tests is sufficient to confirm that the shell is functioning properly. No new tests were added. - A new pytest command line option was added in conftest.py to enable a user to specify a path to an alternate impala-shell executable to test. It's possible to use this to point to an instance of the impala-shell that was installed as a standalone python package in a separate virtualenv. Example usage: USE_THRIFT11_GEN_PY=true impala-py.test --shell_executable=/<path to virtualenv>/bin/impala-shell -sv shell/test_shell_commandline.py The target virtualenv may be based on either python3 or python2. However, this has no effect on the version of python used to run the test framework, which remains tied to python 2.7.x for the foreseeable future. - The $IMPALA_HOME/bin/impala-shell.sh now sets up the impala-shell python environment independenty from bin/set-pythonpath.sh. The default version of thrift is thrift-0.11.0 (See IMPALA-9489). - The wording of the header changed a bit to include the python version used to run the shell. Starting Impala Shell with no authentication using Python 3.7.5 Opened TCP connection to localhost:21000 ... OR Starting Impala Shell with LDAP-based authentication using Python 2.7.12 Opened TCP connection to localhost:21000 ... - By far, the biggest hassle has been juggling str versus unicode versus bytes data types. Python 2.x was fairly loose and inconsistent in how it dealt with strings. As a quick demo of what I mean: Python 2.7.12 (default, Nov 12 2018, 14:36:49) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> d = 'like a duck' >>> d == str(d) == bytes(d) == unicode(d) == d.encode('utf-8') == d.decode('utf-8') True ...and yet there are weird unexpected gotchas. >>> d.decode('utf-8') == d.encode('utf-8') True >>> d.encode('utf-8') == bytearray(d, 'utf-8') True >>> d.decode('utf-8') == bytearray(d, 'utf-8') # fails the eq property? False As a result, this was inconsistency was reflected in the way we handled strings in the impala-shell code, but things still just worked. In python3, there's a much clearer distinction between strings and bytes, and as such, much tighter type consistency is expected by standard libs like subprocess, re, sqlparse, prettytable, etc., which are used throughout the shell. Even simple calls that worked in python 2.x: >>> import re >>> re.findall('foo', b'foobar') ['foo'] ...can throw exceptions in python 3.x: >>> import re >>> re.findall('foo', b'foobar') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/data0/systest/venvs/py3/lib/python3.7/re.py", line 223, in findall return _compile(pattern, flags).findall(string) TypeError: cannot use a string pattern on a bytes-like object Exceptions like this resulted in a many, if not most shell tests failing under python 3. What ultimately seemed like a better approach was to try to weed out as many existing spurious str.encode() and str.decode() calls as I could, and try to implement what is has colloquially been called a "unicode sandwich" -- namely, "bytes on the outside, unicode on the inside, encode/decode at the edges." The primary spot in the shell where we call decode() now is when sanitising input... args = self.sanitise_input(args.decode('utf-8')) ...and also whenever a library like re required it. Similarly, str.encode() is primarily used where a library like readline or csv requires is. - PYTHONIOENCODING needs to be set to utf-8 to override the default setting for python 2. Without this, piping or redirecting stdout results in unicode errors. - from __future__ import unicode_literals was added throughout Testing: To test the changes, I ran the e2e shell tests the way we always do (against the normal build tarball), and then I set up a python 3 virtual env with the shell installed as a package, and manually ran the tests against that. No effort has been made at this point to come up with a way to integrate testing of the shell in a python3 environment into our automated test processes. Change-Id: Idb004d352fe230a890a6b6356496ba76c2fab615 Reviewed-on: http://gerrit.cloudera.org:8080/15524 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Impala Public Jenkins <[email protected]>
1 parent 6fcc758 commit bc9d7e0

19 files changed

+352
-149
lines changed

bin/impala-shell.sh

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,38 @@
1717
# specific language governing permissions and limitations
1818
# under the License.
1919

20+
set -euo pipefail
2021

2122
# This script runs the impala shell from a dev environment.
23+
PYTHONPATH=${PYTHONPATH:-}
2224
SHELL_HOME=${IMPALA_SHELL_HOME:-${IMPALA_HOME}/shell}
23-
exec impala-python ${SHELL_HOME}/impala_shell.py "$@"
25+
26+
# ${IMPALA_HOME}/bin has bootstrap_toolchain.py, required by bootstrap_virtualenv.py
27+
PYTHONPATH=${PYTHONPATH}:${IMPALA_HOME}/bin
28+
29+
# Default version of thrift for the impala-shell is thrift >= 0.11.0.
30+
PYTHONPATH=${PYTHONPATH}:${SHELL_HOME}/build/thrift-11-gen/gen-py
31+
IMPALA_THRIFT_PY_VERSION=${IMPALA_THRIFT11_VERSION}
32+
33+
THRIFT_PY_ROOT="${IMPALA_TOOLCHAIN}/thrift-${IMPALA_THRIFT_PY_VERSION}"
34+
35+
LD_LIBRARY_PATH+=":$(PYTHONPATH=${PYTHONPATH} \
36+
python "$IMPALA_HOME/infra/python/bootstrap_virtualenv.py" \
37+
--print-ld-library-path)"
38+
39+
IMPALA_PY_DIR="$(dirname "$0")/../infra/python"
40+
IMPALA_PYTHON_EXECUTABLE="${IMPALA_PY_DIR}/env/bin/python"
41+
42+
for PYTHON_LIB_DIR in ${THRIFT_PY_ROOT}/python/lib{64,}; do
43+
[[ -d ${PYTHON_LIB_DIR} ]] || continue
44+
for PKG_DIR in ${PYTHON_LIB_DIR}/python*/site-packages; do
45+
PYTHONPATH=${PYTHONPATH}:${PKG_DIR}/
46+
done
47+
done
48+
49+
# Note that this uses the external system python executable
50+
PYTHONPATH=${PYTHONPATH} python "${IMPALA_PY_DIR}/bootstrap_virtualenv.py"
51+
52+
# This uses the python executable in the impala python env
53+
PYTHONIOENCODING='utf-8' PYTHONPATH=${PYTHONPATH} \
54+
exec "${IMPALA_PYTHON_EXECUTABLE}" ${SHELL_HOME}/impala_shell.py "$@"

bin/set-pythonpath.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,9 @@
2727
export PYTHONPATH=${IMPALA_HOME}:${IMPALA_HOME}/bin
2828

2929
# Generated Thrift files are used by tests and other scripts.
30-
if [ -n "${USE_THRIFT11_GEN_PY:-}" ]; then
30+
if [ "${USE_THRIFT11_GEN_PY:-}" == "true" ]; then
3131
PYTHONPATH=${PYTHONPATH}:${IMPALA_HOME}/shell/build/thrift-11-gen/gen-py
32+
THRIFT_HOME="${IMPALA_TOOLCHAIN}/thrift-${IMPALA_THRIFT11_VERSION}"
3233
else
3334
PYTHONPATH=${PYTHONPATH}:${IMPALA_HOME}/shell/gen-py
3435
fi

shell/TSSLSocketWithWildcardSAN.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
# KIND, either express or implied. See the License for the
1717
# specific language governing permissions and limitations
1818
# under the License.
19+
from __future__ import print_function, unicode_literals
1920

2021
import re
2122
import ssl

shell/compatibility.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,20 @@
1919
# under the License.
2020
from __future__ import print_function, unicode_literals
2121

22-
2322
"""
2423
A module where we can aggregate python2 -> 3 code contortions.
2524
"""
2625

26+
import os
27+
import sys
28+
29+
30+
if sys.version_info.major == 2:
31+
# default is typically ASCII, but unicode_literals dictates UTF-8
32+
# See also https://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python # noqa
33+
os.environ['PYTHONIOENCODING'] = 'utf-8'
34+
35+
2736
try:
2837
_xrange = xrange
2938
except NameError:

shell/impala-shell

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
# under the License.
1919

2020

21-
# This script runs the Impala shell. Python is required.
21+
# This script runs the Impala shell. Python is required.
2222
#
2323
# This script assumes that the supporting library files for the Impala shell are
2424
# rooted in either the same directory that this script is in, or in a directory
@@ -51,4 +51,4 @@ for EGG in $(ls ${SHELL_HOME}/ext-py/*.egg); do
5151
done
5252

5353
PYTHONPATH="${EGG_PATH}${SHELL_HOME}/gen-py:${SHELL_HOME}/lib:${PYTHONPATH}" \
54-
exec python ${SHELL_HOME}/impala_shell.py "$@"
54+
PYTHONIOENCODING='utf-8' exec python ${SHELL_HOME}/impala_shell.py "$@"

shell/impala_client.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
23
#
34
# Licensed to the Apache Software Foundation (ASF) under one
45
# or more contributor license agreements. See the NOTICE file
@@ -16,7 +17,7 @@
1617
# KIND, either express or implied. See the License for the
1718
# specific language governing permissions and limitations
1819
# under the License.
19-
from __future__ import print_function
20+
from __future__ import print_function, unicode_literals
2021
from compatibility import _xrange as xrange
2122

2223
from bitarray import bitarray
@@ -118,7 +119,7 @@ def __init__(self, impalad, kerberos_host_fqdn, use_kerberos=False,
118119
ldap_password=None, use_ldap=False, client_connect_timeout_ms=60000,
119120
verbose=True, use_http_base_transport=False, http_path=None):
120121
self.connected = False
121-
self.impalad_host = impalad[0].encode('ascii', 'ignore')
122+
self.impalad_host = impalad[0]
122123
self.impalad_port = int(impalad[1])
123124
self.kerberos_host_fqdn = kerberos_host_fqdn
124125
self.imp_service = None
@@ -378,8 +379,8 @@ def _get_http_transport(self, connect_timeout_ms):
378379

379380
if self.use_ldap:
380381
# Set the BASIC auth header
381-
auth = base64.encodestring(
382-
"{0}:{1}".format(self.user, self.ldap_password)).strip('\n')
382+
user_passwd = "{0}:{1}".format(self.user, self.ldap_password)
383+
auth = base64.encodestring(user_passwd.encode()).decode().strip('\n')
383384
transport.setCustomHeaders({"Authorization": "Basic {0}".format(auth)})
384385

385386
transport.open()
@@ -1005,12 +1006,12 @@ def _get_thrift_client(self, protocol):
10051006
return ImpalaService.Client(protocol)
10061007

10071008
def _options_to_string_list(self, set_query_options):
1008-
if sys.version_info.major < 3:
1009-
key_value_pairs = set_query_options.iteritems()
1010-
else:
1011-
key_value_pairs = set_query_options.items()
1009+
if sys.version_info.major < 3:
1010+
key_value_pairs = set_query_options.iteritems()
1011+
else:
1012+
key_value_pairs = set_query_options.items()
10121013

1013-
return ["%s=%s" % (k, v) for (k, v) in key_value_pairs]
1014+
return ["%s=%s" % (k, v) for (k, v) in key_value_pairs]
10141015

10151016
def _open_session(self):
10161017
# Beeswax doesn't have a "session" concept independent of connections, so
@@ -1241,4 +1242,3 @@ def _do_beeswax_rpc(self, rpc, suppress_error_on_cancel=True):
12411242
raise RPCException("ERROR: %s" % e.message)
12421243
if "QueryNotFoundException" in str(e):
12431244
raise QueryStateException('Error: Stale query handle')
1244-

0 commit comments

Comments
 (0)