Handle unicode regularizer

On Python 3, the default string is of Unicode type, which caused this comparison some issues. In particular, the length comparison was off as the Unicode string may have more bytes than the equivalent ASCII string. To fix that, just encode Unicode strings as ASCII and convert them to C strings. This handles Python byte strings just as well. Then it is a simple matter to compare the string length and string value. If the encoding goes wrong (like if it isn't any kind of string), then we get a `NULL` value, which we raise for just like if the string didn't match the right value.
lucastheis · Jun 12, 2017 · d93778d · d93778d
1 parent fa40d03
commit d93778d
Showing 1 changed file with 5 additions and 2 deletions.
diff --git a/code/cmt/python/src/pyutils.cpp b/code/cmt/python/src/pyutils.cpp
@@ -1,5 +1,6 @@
 #include "pyutils.h"
 #include <inttypes.h>
+#include <string.h>
 
 #include "cmt/utils"
 using CMT::Exception;
@@ -378,10 +379,12 @@ Regularizer PyObject_ToRegularizer(PyObject* regularizer) {
 		Regularizer::Norm norm = Regularizer::L2;
 
 		if(r_norm) {
-			if(PyString_Size(r_norm) != 2)
+			char* r_norm_str = PyString_AsString(r_norm);
+
+			if((r_norm_str == NULL) || (strlen(r_norm_str) != 2))
 				throw Exception("Regularizer norm should be 'L1' or 'L2'.");
 
-			switch(PyString_AsString(r_norm)[1]) {
+			switch(r_norm_str[1]) {
 				default:
 					throw Exception("Regularizer norm should be 'L1' or 'L2'.");