-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Fix atof
precision issues and use SOTA algorithm
#3434
base: nightly
Are you sure you want to change the base?
[stdlib] Fix atof
precision issues and use SOTA algorithm
#3434
Conversation
Signed-off-by: gabrieldemarmiesse <[email protected]>
@lemire We use your work here. If you have the time, feel free to give feedback! And congratulation for the papers, they're really easy to read! |
Hi @gabrieldemarmiesse this is awesome, seems like you're on a crusade for performance in the stdlib and I love it 😆. ideas:
|
@gabrieldemarmiesse This looks like great work. Have you tried benchmarking it, if only privately (for yourself)? A file that I like to use for benchmarking is canada.txt: https://github.com/lemire/simple_fastfloat_benchmark/blob/master/data/canada.txt I can help with the benchmarking if needed. |
@martinvuyk thanks for your review!
I will hold off manually inlining things until I can prove with a benchmark that it's worth it.
I don't have opinions in general about code organization (where we put functions, constants and such). So I'll wait for a stdlib maintainer to tell me if they have something specific in mind regarding this. |
@lemire Thank you for taking a look at this! Concerning benchmarking, I haven't done anything myself, but I'd love to. If we want to benchmark we would have first to disable / adapt the functions that "malloc" in atof, notably change those functions: fn strip_unused_characters(x: String) -> String:
result = x.strip() # malloc + copy here
result = result.removesuffix("f") # malloc + copy here
result = result.removesuffix("F") # malloc + copy here
result = result.removeprefix("+") # malloc + copy here
return result
fn get_sign(x: String) -> Tuple[Float64, String]:
if x.startswith("-"):
return (-1.0, x[1:]) # malloc + copy here
return (1.0, x) # malloc + copy here That's because the Mojo If we do that, we could benchmark correctly. Though we wouldn't be able to compare to what's already in the stdlib as the current I think it's something that I will look into end of this week / next week as I'm currently busy with making the next community meeting talk and I have also this PR that I need to cleanup, benchmark and it also needs a good description: This PR is also based on your work by the way :) After I'm done with this, I'd be very happy to get some help with benchmarking. I'm sure there are things that I can improve in my method. |
@gabrieldemarmiesse We can do comparative benchmarking with other systems. That's why I was offering to help. We can know how good it can be, because we have other implementations. So if there are problems (performance problems), they can be identified. :-) |
Sure thing, that's also what I had in mind. I'll ping you when I'm done with the talk and the b64 PR :) |
:-/ |
Could you elaborate on this? |
I suspect that @gabrieldemarmiesse refers to I have not studied it carefully, but it seems to do a first pass which maps to data to stack buffers, and then it does some conversion ( I think that's one piece of code that might benefit from benchmarking and optimization. (Note that I am not saying it is slow or suboptimal. I do not know.) |
Indeed, @lemire you're right there. It's one piece of code that wasn't described in the paper, so I crafted something to parse the decimal exponent and significand using as much as possible the best practices (simd to reduce instructions, loop unrolling to avoid branches). That can also be seen as premature optimization in a way and would benefit from benchmarking/profiling, as the rest of the code. I was digging into benchmarking, the
Those optimizations are really related to |
My comments are not an objection to this PR, and I do recommend it gets merged... I was only elaborating on the point you raised in the PR comment, so that people reading this understand the issue better. (@soraros asked a question.) I think it is worth writing it out because folks might use this PR later for reference to understand where the implementation is at. The string parsing component can be a bottleneck and is trickier than it seems. In Section 4 of the following article, string processing is discussed... The C# implementation of this approach is at https://github.com/CarlVerret/csFastFloat/blob/3752dc9b41570e3d2b9ab98db54ef8921474ddce/csFastFloat/Structures/ParsedNumberString.cs#L17 It should probably be considered as it has proven to be quite fast. After merging this PR, I recommend trying other strategies for the string processing component with benchmarks. I'd be happy to contribute. |
Signed-off-by: gabrieldemarmiesse <[email protected]>
I have a version of the @always_inline("nodebug")
fn full_mul(lhs: UInt64, rhs: UInt64) -> UInt128:
l = __mlir_op.`pop.cast`[_type = __mlir_type.`!pop.scalar<ui128>`](lhs.value)
r = __mlir_op.`pop.cast`[_type = __mlir_type.`!pop.scalar<ui128>`](rhs.value)
mul = __mlir_op.`pop.mul`(l, r)
return __mlir_op.`pop.bitcast`[_type = __mlir_type.`!pop.simd<2, ui64>`](mul) LLVM IR: define dso_local <2 x i64> @full_mul(i64 noundef %0, i64 noundef %1) #0 {
%3 = zext i64 %0 to i128
%4 = zext i64 %1 to i128
%5 = mul i128 %3, %4
%6 = bitcast i128 %5 to <2 x i64>
ret <2 x i64> %6
} Assembly (basically single instruction): full_mul_llvm:
movq %rsi, %rdx
mulxq %rdi, %rcx, %rax
vmovq %rax, %xmm0
vmovq %rcx, %xmm1
vpunpcklqdq %xmm0, %xmm1, %xmm0
retq |
@soraros thank you for providing this code snippet! I believe UInt128 multiplication can be tackled in another PR and would deserve a public API (add a DType for it so that we can have simd support, etc...). Once this is done, we can use it in atof to improve the performance and that change can be the topic of another PR. |
Signed-off-by: gabrieldemarmiesse <[email protected]>
@jackos given your recent work with floats, do you mind taking a look at this one, please? |
!sync |
Goal
Fix the correctness problem of the
atof
function.Fixes the following issue:
atof
returns a wrong result for too big or too small exponents #3419Supersedes the following PR:
This pull request is a complete reimplementation of the
atof
function, which was returning incorrect numbers due to doing too many floating point operations and loosing accuracy in the process.The primary goal of this PR was to provide a correct implementation and raising an error when the function cannot handle the input. Since this require a full rewriting and a quite complexe parsing, let's use a state of the art algorithm while we're at it.
The rational for going with state of the art directly is also that this function will be used massively everywhere when parsing json, xml, yaml, user inputs, csv... Thus performance is very important here. While we can't provide a very fast
atof
right now, we want to lay out the groundwork to make this possible.As a side note, the error messages were improved.
Implementation
The algorithm chosen here was the one from the following papers:
The reference implementation used for this PR can be found here: https://github.com/CarlVerret/csFastFloat
The part about parsing digits was not taken from the paper and was written by myself, with a focus on using SIMD and avoid hard to predict branches.
Limitations
Performance
I'm not sure it would be interesting to talk about the performance benefits of this pull request because:
atof
was returning wrong values, and it's no use running a benchmark against an incorrect function.String.removesuffix()
which returns aString
instead of aStringSlice
and the fact thatString
doesn't have short string optimization yet, meaning that callingString.removesuffix()
do a memory allocation on the heap.full_multiplication
function) which takes two 64 bits integers and returns two 64 bits integers (high bits and low bits).UInt64
. Generating those constants at compile-time is not possible yet as Mojo's compile-time interpreter has some progress to do first.I expect that when more progress have been done on those issues, we can start benchmarking the
atof
function seriously and look into further optimization. The base algorithm is there, as well as the unit tests and so it should make the optimization step very easy.Correctness
We can't currently process floating point representation which have a decimal significand that doesn't fit in an
UInt64
. e.g.47421763.548648646474532187448684
because47421763548648646474532187448684
doesn't fit in aUInt64
.Rather than returning a wrong number, we produce an error.
This limitation can surely be removed in the future, but was deemed out-of-scope for this pull request.
Merging this PR