-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change String handling in Python 3 to byte type #54
base: develop
Are you sure you want to change the base?
Conversation
The byte type allows raw binary data to be passed in from std::strings. It becomes necessary in the python to .decode('utf-8') in most cases, but allows the programmer to make the choice as to when and where the text is encoded in python instead of it being assumed it always need be. Use case - passing binary in a std::string containing 0xff and other hex to python, passing byte object from python to c++ std::string.
My understanding is in Python 2 std::string -> Python string handles with this behavior. wstring seems more appropriate for a Python 3 c++ string -> Python unicode string than c++ std::string does. |
8ccdcff
to
3ace4a0
Compare
I have a c++ char* variable which is gbk encoded. It will have the same issue when c++ char* return to python3 unicode. I want to change the type char* to python3 bytes. how can I do? |
I don't agree with this PR. It seems best to assume that See this comment: #85 (comment) |
If the user is using another encoding everywhere, then another option would be to allow the "default encoding" to be UTF-8, but customizable (somethink like calling |
We have been using But I can see why some people might adopt a different convention, for example to automatically convert Would be possible to add some support to Boost::Python so this can be overwritten by library authors? AFAIU boost's internals that would not be possible right, because boost's conversion registry is global. This is a problem if, for example, I'm using two libraries ( |
The C++ standard only defines std::string to be a string of char's and says nothing about encodings. It can hold any arbitrary information, not just text in utf-8. |
I agree, |
I am not positive, but i do believe the converters are part of the header portion of boost python(not the compiled library portion). It has been so long i cant remember for sure anymore. |
Thanks for the response @centerionware, those are all valid points. We chose to use But I agree that is not general solution, and can break other APIs which don't follow that assumption. I've seem APIs which return text encoded on the system encoding, which can definitely be incompatible with I think returning raw if api.get_name() == b'W1': Instead of: if api.get_name() == 'W1': As there is no implicit conversion between
If that's the case, a library could be compiled with a proper |
I don't agree, |
My point isn't about this patch, it is about the intent of |
@stefanseefeld I agree that I agree that this does not seem ideal, but also there does not seem to be an "ideal" solution if we consider practicality vs. purity: it really helps a lot to port code base (from ASCII to Unicode and from Py2 to Py3) that use |
I see what you mean, thanks. I agree the API assumes fixed-width characters, but I've seen Is there a standard recommendation on how to pass binary and text data around? Using |
Interestingly, pybind11 went with the approach of considering http://pybind11.readthedocs.io/en/master/advanced/cast/overview.html#conversion-table Here's the relevant code for those interested. |
Thanks for the reference, that's indeed very useful to know. Still, I maintain my point that |
@tadeu |
@wythend, this is our patch: index c2e01c0..c39d5d7 100644
--- boost/python/converter/builtin_converters.hpp
+++ boost/python/converter/builtin_converters.hpp
@@ -13,6 +13,9 @@
# include <complex>
# include <boost/limits.hpp>
+#define BOOST_PYTHON_FORCE_UNICODE
+
// Since all we can use to decide how to convert an object to_python
// is its C++ type, there can be only one such converter for each
// type. Therefore, for built-in conversions we can bypass registry
@@ -156,6 +159,10 @@ BOOST_PYTHON_TO_PYTHON_BY_VALUE(unsigned BOOST_PYTHON_LONG_LONG, ::PyLong_FromUn
BOOST_PYTHON_TO_PYTHON_BY_VALUE(char, converter::do_return_to_python(x), &PyUnicode_Type)
BOOST_PYTHON_TO_PYTHON_BY_VALUE(char const*, converter::do_return_to_python(x), &PyUnicode_Type)
BOOST_PYTHON_TO_PYTHON_BY_VALUE(std::string, ::PyUnicode_FromStringAndSize(x.data(),implicit_cast<ssize_t>(x.size())), &PyUnicode_Type)
+#elif defined(BOOST_PYTHON_FORCE_UNICODE)
+BOOST_PYTHON_TO_PYTHON_BY_VALUE(char, converter::do_return_to_python(x), &PyString_Type)
+BOOST_PYTHON_TO_PYTHON_BY_VALUE(char const*, converter::do_return_to_python(x), &PyString_Type)
+BOOST_PYTHON_TO_PYTHON_BY_VALUE(std::string, ::PyUnicode_FromStringAndSize(x.data(),implicit_cast<ssize_t>(x.size())), &PyUnicode_Type)
#else
BOOST_PYTHON_TO_PYTHON_BY_VALUE(char, converter::do_return_to_python(x), &PyString_Type)
BOOST_PYTHON_TO_PYTHON_BY_VALUE(char const*, converter::do_return_to_python(x), &PyString_Type)
diff --git libs/python/src/converter/builtin_converters.cpp libs/python/src/converter/builtin_converters.cpp
index 1c28af7..cde46f8 100644
--- libs/python/src/converter/builtin_converters.cpp
+++ libs/python/src/converter/builtin_converters.cpp
@@ -366,7 +366,7 @@ namespace
static PyTypeObject const* get_pytype() { return &PyFloat_Type;}
};
-#if PY_VERSION_HEX >= 0x03000000
+#if PY_VERSION_HEX >= 0x03000000 || defined(BOOST_PYTHON_FORCE_UNICODE)
unaryfunc py_unicode_as_string_unaryfunc = PyUnicode_AsUTF8String;
#endif
@@ -379,14 +379,16 @@ namespace
#if PY_VERSION_HEX >= 0x03000000
return (PyUnicode_Check(obj)) ? &py_unicode_as_string_unaryfunc :
PyBytes_Check(obj) ? &py_object_identity : 0;
+#elif defined(BOOST_PYTHON_FORCE_UNICODE)
+ return (PyUnicode_Check(obj)) ? &py_unicode_as_string_unaryfunc : 0;
#else
return (PyString_Check(obj)) ? &obj->ob_type->tp_str : 0;
#endif
};
- // Remember that this will be used to construct the result object
-#if PY_VERSION_HEX >= 0x03000000
+ // Remember that this will be used to construct the result object
+#if PY_VERSION_HEX >= 0x03000000 || defined(BOOST_PYTHON_FORCE_UNICODE)
static std::string extract(PyObject* intermediate)
{
return std::string(PyBytes_AsString(intermediate),PyBytes_Size(intermediate)); but you'll probably have to use Also note that this is only for Python 2. |
THANK YOU VERY MUCH. I will try it and read it until i understand how types convert between c++ and python. |
The wrapper that allows us to use Python file-like objects as C++ streams was specific to Python 2 inasmuch as it wrapped data in a str() before sending it to the write() method. As of Boost 1.67, Boost.Python doesn't have a convenient wrapper for bytes the way it does str. We instantiate a bytes() object around the raw data ourselves. Some of the issues involved are discussed in this pull request: boostorg/python#54
The wrapper that allows us to use Python file-like objects as C++ streams was specific to Python 2 inasmuch as it wrapped data in a str() before sending it to the write() method. As of Boost 1.67, Boost.Python doesn't have a convenient wrapper for bytes the way it does str. We instantiate a bytes() object around the raw data ourselves. Some of the issues involved are discussed in this pull request: boostorg/python#54
At the very least it should be easy to optionally return/receive/handle bytes. IE: built-in encode/decode from some boost/py bytes vector. |
+1 to this. I have a library with functions returning arbitrary binary in |
Let's talk about basic data-types for a second. struct string { char*s, size_t si} (most basic string) is the very most basic we can break it down to for example purpose only (obviously the standard library is more detailed) I have used this patch for a few years now, others may use it or want to. It should be a #ifdef to change behaviour, or some 'std:text' or 'std::unicode' (which curiously I've never heard of) should be used for only text or unicode. it sucks that wstring was created for 'wide strings' that can contain characters that are multi-byte and implemented differently across os's. but std::string::size() returns the bytes in a string. std::string::c_str() returns it as char* (not null terminated). c++ strings are not made for text, they're made for arbitrary data often used for text. |
@centerionware minor correction: But other than that I agree with you. |
You're right, I had to look that one up. It's std::string::data() that doesn't guarantee a null termination, where c_str() does. |
The wrapper that allows us to use Python file-like objects as C++ streams was specific to Python 2 inasmuch as it wrapped data in a str() before sending it to the write() method. As of Boost 1.67, Boost.Python doesn't have a convenient wrapper for bytes the way it does str. We instantiate a bytes() object around the raw data ourselves. Some of the issues involved are discussed in this pull request: boostorg/python#54
The wrapper that allows us to use Python file-like objects as C++ streams was specific to Python 2 inasmuch as it wrapped data in a str() before sending it to the write() method. As of Boost 1.67, Boost.Python doesn't have a convenient wrapper for bytes the way it does str. We instantiate a bytes() object around the raw data ourselves. Some of the issues involved are discussed in this pull request: boostorg/python#54
The byte type allows raw binary data to be passed in from std::strings.
It becomes necessary in the python to .decode('utf-8') in most cases,
but allows the programmer to make the choice as to when and where the
text is encoded in python instead of it being assumed it always need be.
Use case - passing binary in a std::string containing 0xff and others
to python, passing byte object from python to c++ as an std::string.