-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicode crash #56
Comments
@ferd, any comments? |
I don't know what the original code seemed to be or why the output fails. Trying this:
Yields valid output when encoding, which tells me the conversions we build ourselves aren't an issue. However, it appears that the string itself has some invalid characters when we try to input it. A valid JSON string representation stops at character 21-22 of the string:
However, if we print it and let the Erlang drivers figure out whatever encoding:
has a valid string representation (in my shell. It looks broken here). This, to me, seems to point to a problem within jsx's JSON conversion when reading from unicode strings given the shell is able to figure something out that jsx doesn't seem able to. There is a big HOWEVER there. The unicode string given is quite a mess. I'm wondering if it's possible that some of the characters given are control characters (I strongly suspect some are). Such control characters must be encoded using \u and 4 hex digits. So if you're having control characters through that unicode string, it's quite normal for jsx to die on it -- they need to be properly escaped. dvarkin, can you check to see if you have any of these? |
hi! thanks for reply. This json has not any control symbols, only some cyrillic. I think, the problem is in incorrect calc size of unicode string, that socketio_data:json/2 receive as "Length" argument. Maybe this code, has an incorrect match: header(?FRAME ++ Rest=[|], Acc)-> %%% I have I don't know the best solution in this case, maybe working with unicode as binary? |
That could be something related to that. The way code points are seen, it sounds like a very good candidate for the issue where we'd need to read a binary character per character (pattern matching on a utf8 type). Sadly, everything related to unicode in Erlang has to be switched to binaries. The server (misultin) as it is uses lists by default and it then becomes rather unclear what we should do. For JSON strings, we can likely switch things to binary and accumulate data until we have the right length. For regular text though, we'd have no way to know what encoding the user had and then we risk interpreting them in a type they didn't intend. We could force utf8 by default, but I'm not sure it's the best of ideas. In any case, we need to resolve this. Any opinions? Sidenote: the fix dvarkin posted doesn't seem safe to me as it drops the parsed message size. The issue is that the socket.io client will sometimes concatenate two messages ( |
yes, it's not a fix. don't use that! We are using QuotedPrintable but this unsafe to, because of client side. |
I'll try to take some time during my lunch break to generate a good failing test case for the decoding to try and fix the lenght issue. I'll be waiting for comments, but for the meantime, it feels that assuming UTF8 is the safest option -- it'll play well with ASCII and ISO-8859-1 users, although it might be problematic for the UCS-2 and UCS-4 (Windows, Python), UTF16, UTF32, etc. We'll at least always be safe with JSON parsing. |
I studied the problem a bit. The character that causes problems is a unicode control character (U+0084 => 139). There are dozens of them, and they can be inserted, before, between and after any set of characters to modify them into a single visible character. To me, this would be the reason as to why the length of the string gets messed up. If I just take the subsequence [209,139], I get the following reasoning: The resulting character under a binary representation (ы) is the direct two list entries ([209,139]) as a 2-bytes binary (<<209,139>>). However, if I take [209,139] as their literal meaning, it is the "Ñ" character plus a hidden value. This means that while in both cases I need to get somewhat very clever to solve the issue, I first have to figure out if the faulty input you have was meant to be [209, 139] vs. <<209,139>>. Dvarkin, if you could tell me which one of the two it is, I can know the format in which we do things (choosing between unicode:characters_to_list(list_to_binary(Str)) vs. leaving it as it is vs. list_to_binary(Str) vs. unicode:characters_to_binary(Str), etc.) and try to fix stuff. Then I'll need to figure out how the hell we're supposed to find the length of a string based on its control characters. Usually, languages have functions for that, but it seems we don't in Erlang. I'll have to see whatever Javascript does and try to re-implement it here, given that the length given in the serialized string is likely based on that. This is going to prove challenging for bare messages, but I could just let jsx do some stream parsing if I find a json structure, which should be way, way easier. I'm not sure what the performance impacts might be. |
Hi @dvarkin, Check out my branch (https://github.com/ferd/socket.io-erlang) (or wait until it is merged) to see if the changes I brought fix your issues. Hopefully they will. |
Hi @ferd, I tried your fix to send the following message via xhr-polling transport: |
@sinnus, can you share that stack trace, please? |
=ERROR REPORT==== 26-Sep-2011::12:24:16 === =CRASH REPORT==== 26-Sep-2011::12:24:16 === =ERROR REPORT==== 26-Sep-2011::12:24:16 === =CRASH REPORT==== 26-Sep-2011::12:24:16 === =CRASH REPORT==== 26-Sep-2011::12:24:16 === =SUPERVISOR REPORT==== 26-Sep-2011::12:24:16 === =SUPERVISOR REPORT==== 26-Sep-2011::12:24:16 === =PROGRESS REPORT==== 26-Sep-2011::12:24:16 === =CRASH REPORT==== 26-Sep-2011::12:24:16 === =SUPERVISOR REPORT==== 26-Sep-2011::12:24:16 === =CRASH REPORT==== 26-Sep-2011::12:24:16 === =SUPERVISOR REPORT==== 26-Sep-2011::12:24:16 === =ERROR REPORT==== 26-Sep-2011::12:24:16 === =ERROR REPORT==== 26-Sep-2011::12:24:16 === =ERROR REPORT==== 26-Sep-2011::12:24:16 === =ERROR REPORT==== 26-Sep-2011::12:24:16 === |
My hotfix for the issue |
OK, looking at the stack trace, I see:
Socket.io-erlang handles unicode correctly. The problem has to do with how misultin does it. The issue, as far as I can see, is that the unicode is parsed as a binary, which turns all code points in bytes (0..255). Then the binaries are blindly turned into lists, but they don't have the same unicode format in Erlang -- you actually need to convert them to codepoints greater than 255, and then output them with a This is covered on our side, but visibly not on Ostinelli's side (misultin). We can likely hot-patch it either in the data parser the way you did, but I'll be filing a bug report with Ostinelli to see if he can make Misultin right in the first place instead. Here'S the issue on misultin: ostinelli/misultin#61 |
Yes, this is related to how utf-8 is encoded and the size of bytes it uses. |
To summarize my fixes: |
…e)serialization)
We we're try to send unicode data to websocket, but had a crash with this error
** Generic server <0.2107.0> terminating
** Last message in was {websocket,
[126,109,126,50,49,126,109,126,126,106,126,123,34,
116,101,120,116,34,58,34,209,139,209,132,208,178,
208,176,208,178,209,139,208,176,34,125],
{misultin_ws,
{ws,#Port<0.5164>,http,false,
{10,1,5,51},
61776,undefined,
{'draft-hixie',76},
"http://localhost:7879","10.1.5.51:7879",
"/socket.io/websocket",
[{'Upgrade',"WebSocket"},
{'Connection',"Upgrade"},
{'Host',"10.1.5.51:7879"},
{"Origin","http://localhost:7879"},
{"Sec-Websocket-Key1",
"3U4 06< <86 0O0 ) 48"},
{"Sec-Websocket-Key2",
"3 6 8g. )r2 1 4 W_2,370"}],
false},
<0.2105.0>}}
** When Server state == {state,"e95bc077-5f92-4bf4-9bbc-ba98e4636d5a",
undefined,socketio_http_misultin,
{websocket,
{misultin_ws,
{ws,#Port<0.5164>,http,false,
{10,1,5,51},
61776,undefined,
{'draft-hixie',76},
"http://localhost:7879","10.1.5.51:7879",
"/socket.io/websocket",
[{'Upgrade',"WebSocket"},
{'Connection',"Upgrade"},
{'Host',"10.1.5.51:7879"},
{"Origin","http://localhost:7879"},
{"Sec-Websocket-Key1",
"3U4 06< <86 0O0 ) 48"},
{"Sec-Websocket-Key2","3 6 8g. )r2 1 4 W_2,370"}],
false},
<0.2105.0>}},
1,
{#Ref<0.0.0.4154>,10000},
<0.2108.0>,<0.82.0>}
** Reason for termination ==
** {badarg,[{jsx_eep0018,collect,3},
{socketio_data,json,2},
{socketio_transport_websocket,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
After insert this stupid code, every thing goes fine:
json(_Length1, Body) ->
Length = erlang:length(Body),
io:format("~n++++ Length ~p Body
pn", [Length, Body]),{Object, Rest} = lists:split(Length, Body),
io:format("~n++++ Object ~p Rest
pn", [Object, Rest]),[#msg{content=jsx:json_to_term(list_to_binary(Object), [{strict,false}]), json=true} |
handle_rest(Rest)].
socketio_data:json/2
The text was updated successfully, but these errors were encountered: