[stdlib] Fix `atof` precision issues and use SOTA algorithm #3434

gabrieldemarmiesse · 2024-08-30T13:12:55Z

Goal

Fix the correctness problem of the atof function.

Fixes the following issue:

[BUG] atof returns a wrong result for too big or too small exponents #3419

Supersedes the following PR:

[stdlib] Atof bug: Fix shift overflow #3422

This pull request is a complete reimplementation of the atof function, which was returning incorrect numbers due to doing too many floating point operations and loosing accuracy in the process.

The primary goal of this PR was to provide a correct implementation and raising an error when the function cannot handle the input. Since this require a full rewriting and a quite complexe parsing, let's use a state of the art algorithm while we're at it.

The rational for going with state of the art directly is also that this function will be used massively everywhere when parsing json, xml, yaml, user inputs, csv... Thus performance is very important here. While we can't provide a very fast atof right now, we want to lay out the groundwork to make this possible.

As a side note, the error messages were improved.

Implementation

The algorithm chosen here was the one from the following papers:

Number Parsing at a Gigabyte per Second by Daniel Lemire
Fast Number Parsing Without Fallback by Noble Mushtak and Daniel Lemire

The reference implementation used for this PR can be found here: https://github.com/CarlVerret/csFastFloat

The part about parsing digits was not taken from the paper and was written by myself, with a focus on using SIMD and avoid hard to predict branches.

Limitations

Performance

I'm not sure it would be interesting to talk about the performance benefits of this pull request because:

The original atof was returning wrong values, and it's no use running a benchmark against an incorrect function.
We have a few issues in the stdlib which tank the performance of this function, notably String.removesuffix() which returns a String instead of a StringSlice and the fact that String doesn't have short string optimization yet, meaning that calling String.removesuffix() do a memory allocation on the heap.
To avoid many instructions, we need the intrinsics to do a multiplication (full_multiplication function) which takes two 64 bits integers and returns two 64 bits integers (high bits and low bits).
Many constants were hardcoded in this pull request and could be adapted to different architectures and benchmarked independently. I'm notably thinking about the function converting a serie of digit characters into an UInt64. Generating those constants at compile-time is not possible yet as Mojo's compile-time interpreter has some progress to do first.

I expect that when more progress have been done on those issues, we can start benchmarking the atof function seriously and look into further optimization. The base algorithm is there, as well as the unit tests and so it should make the optimization step very easy.

Correctness

We can't currently process floating point representation which have a decimal significand that doesn't fit in an UInt64. e.g. 47421763.548648646474532187448684 because 47421763548648646474532187448684 doesn't fit in a UInt64.

Rather than returning a wrong number, we produce an error.

This limitation can surely be removed in the future, but was deemed out-of-scope for this pull request.

Merging this PR

While the diff can seems big, many of the lines added are constants, comments and licenses.
We can have an open discussion with the maintainers about how to split this PR. As you might know, I very much like small PRs too, but this is also good to have a bird view of the situation with this global PR. If you want to split it and have a specific split in mind, feel free to tell me.

Signed-off-by: gabrieldemarmiesse <[email protected]>

gabrieldemarmiesse · 2024-08-30T13:49:31Z

@lemire We use your work here. If you have the time, feel free to give feedback! And congratulation for the papers, they're really easy to read!

martinvuyk · 2024-09-02T19:34:29Z

Hi @gabrieldemarmiesse this is awesome, seems like you're on a crusade for performance in the stdlib and I love it 😆.

ideas:

Maybe adding @always_inline in a couple of places since these funcs are just helpers for 1 thing ?
could we throw this in the math package and just import atof in string.mojo?
- some of these constant LUTs seem like better fitting the math module (since we are already spending the memory space for the Tables might as well recycle it in other places, dunno what algos would use them though 🤷‍♂️ )
- Almost all of the functions are things that are rather bit manipulation or number "characterization" than having anything to do with String or collections.

lemire · 2024-09-02T19:57:18Z

@gabrieldemarmiesse This looks like great work.

Have you tried benchmarking it, if only privately (for yourself)?

A file that I like to use for benchmarking is canada.txt: https://github.com/lemire/simple_fastfloat_benchmark/blob/master/data/canada.txt

I can help with the benchmarking if needed.

gabrieldemarmiesse · 2024-09-03T09:51:32Z

@martinvuyk thanks for your review!

Maybe adding @always_inline in a couple of places since these funcs are just helpers for 1 thing ?

I will hold off manually inlining things until I can prove with a benchmark that it's worth it.

could we throw this in the math package and just import atof in string.mojo?

I don't have opinions in general about code organization (where we put functions, constants and such). So I'll wait for a stdlib maintainer to tell me if they have something specific in mind regarding this.

gabrieldemarmiesse · 2024-09-03T10:06:43Z

@lemire Thank you for taking a look at this!

Concerning benchmarking, I haven't done anything myself, but I'd love to.

If we want to benchmark we would have first to disable / adapt the functions that "malloc" in atof, notably change those functions:

fn strip_unused_characters(x: String) -> String:
    result = x.strip()                 # malloc + copy here
    result = result.removesuffix("f")  # malloc + copy here
    result = result.removesuffix("F") # malloc + copy here
    result = result.removeprefix("+") # malloc + copy here
    return result


fn get_sign(x: String) -> Tuple[Float64, String]:
    if x.startswith("-"):
        return (-1.0, x[1:]) # malloc + copy here
    return (1.0, x) # malloc + copy here

That's because the Mojo String type is not currently flushed out, and does a lot of copies where it's not needed.

If we do that, we could benchmark correctly. Though we wouldn't be able to compare to what's already in the stdlib as the current atof() has precision issues for big powers of 10 and thus can't be considered a "working baseline".

I think it's something that I will look into end of this week / next week as I'm currently busy with making the next community meeting talk and I have also this PR that I need to cleanup, benchmark and it also needs a good description:

[stdlib] Use SIMD to make b64encode 4.7x faster #3443

This PR is also based on your work by the way :)

After I'm done with this, I'd be very happy to get some help with benchmarking. I'm sure there are things that I can improve in my method.

lemire · 2024-09-03T13:07:05Z

@gabrieldemarmiesse We can do comparative benchmarking with other systems. That's why I was offering to help. We can know how good it can be, because we have other implementations. So if there are problems (performance problems), they can be identified. :-)

gabrieldemarmiesse · 2024-09-03T13:12:22Z

Sure thing, that's also what I had in mind. I'll ping you when I'm done with the talk and the b64 PR :)

gabrieldemarmiesse · 2024-09-13T20:14:40Z

@lemire FYI, benchmarking is on hold until #3460 is fixed, otherwise, loading canada.txt is way too slow (5 min+)

lemire · 2024-09-13T20:46:07Z

too slow (5 min+)

:-/

soraros · 2024-09-19T21:30:56Z

The part about parsing digits was not taken from the paper and was written by myself, with a focus on using SIMD and avoid hard to predict branches.

Could you elaborate on this?

lemire · 2024-09-20T00:08:33Z

@soraros

I suspect that @gabrieldemarmiesse refers to _get_w_and_q_from_float_string. You need to profile this code, but it is possible that it could become a bottleneck.

I have not studied it carefully, but it seems to do a first pass which maps to data to stack buffers, and then it does some conversion (to_integer).

I think that's one piece of code that might benefit from benchmarking and optimization.

(Note that I am not saying it is slow or suboptimal. I do not know.)

gabrieldemarmiesse · 2024-09-20T10:27:36Z

Indeed, @lemire you're right there. It's one piece of code that wasn't described in the paper, so I crafted something to parse the decimal exponent and significand using as much as possible the best practices (simd to reduce instructions, loop unrolling to avoid branches). That can also be seen as premature optimization in a way and would benefit from benchmarking/profiling, as the rest of the code.

I was digging into benchmarking, the mallocs are taking a significant portion of the time, unsurprisingly. We can fix that by doing some optimization on the String methods, like returning a string view instead of a new string in removesuffix.

atof should never allocate on the heap, unless it's for error messages, there is no reason to do so.

Those optimizations are really related to String and thus can be the subject of other PRs I believe. The main contribution here is that this atof implementation is bug-free, unlike the previous one in the stdlib.

lemire · 2024-09-20T12:58:27Z

@gabrieldemarmiesse

My comments are not an objection to this PR, and I do recommend it gets merged...

I was only elaborating on the point you raised in the PR comment, so that people reading this understand the issue better. (@soraros asked a question.)

I think it is worth writing it out because folks might use this PR later for reference to understand where the implementation is at.

The string parsing component can be a bottleneck and is trickier than it seems. In Section 4 of the following article, string processing is discussed...
https://arxiv.org/pdf/2101.11408

The C# implementation of this approach is at https://github.com/CarlVerret/csFastFloat/blob/3752dc9b41570e3d2b9ab98db54ef8921474ddce/csFastFloat/Structures/ParsedNumberString.cs#L17

It should probably be considered as it has proven to be quite fast.

After merging this PR, I recommend trying other strategies for the string processing component with benchmarks. I'd be happy to contribute.

stdlib/src/collections/_parsing_numbers/parsing_floats.mojo

Signed-off-by: gabrieldemarmiesse <[email protected]>

soraros · 2024-10-01T00:37:35Z

I have a version of the full_mul that's super fast and cross-platform (notice it returns SIMD[lo, hi]):

@always_inline("nodebug")
fn full_mul(lhs: UInt64, rhs: UInt64) -> UInt128:
  l = __mlir_op.`pop.cast`[_type = __mlir_type.`!pop.scalar<ui128>`](lhs.value)
  r = __mlir_op.`pop.cast`[_type = __mlir_type.`!pop.scalar<ui128>`](rhs.value)
  mul = __mlir_op.`pop.mul`(l, r)
  return __mlir_op.`pop.bitcast`[_type = __mlir_type.`!pop.simd<2, ui64>`](mul)

LLVM IR:

define dso_local <2 x i64> @full_mul(i64 noundef %0, i64 noundef %1) #0 {
  %3 = zext i64 %0 to i128
  %4 = zext i64 %1 to i128
  %5 = mul i128 %3, %4
  %6 = bitcast i128 %5 to <2 x i64>
  ret <2 x i64> %6
}

Assembly (basically single instruction):

full_mul_llvm:
  movq	%rsi, %rdx
  mulxq	%rdi, %rcx, %rax
  vmovq	%rax, %xmm0
  vmovq	%rcx, %xmm1
  vpunpcklqdq	%xmm0, %xmm1, %xmm0
  retq

gabrieldemarmiesse · 2024-10-30T08:58:36Z

@soraros thank you for providing this code snippet! I believe UInt128 multiplication can be tackled in another PR and would deserve a public API (add a DType for it so that we can have simd support, etc...). Once this is done, we can use it in atof to improve the performance and that change can be the topic of another PR.

Signed-off-by: gabrieldemarmiesse <[email protected]>

JoeLoser · 2024-11-15T02:00:32Z

@jackos given your recent work with floats, do you mind taking a look at this one, please?

JoeLoser · 2024-11-21T19:21:07Z

!sync

[stdlib] Fix atof precision issues and use SOTA algorithm

da263f4

Signed-off-by: gabrieldemarmiesse <[email protected]>

gabrieldemarmiesse mentioned this pull request Aug 30, 2024

[stdlib] Atof bug: Fix shift overflow #3422

Closed

gabrieldemarmiesse marked this pull request as ready for review August 30, 2024 13:47

gabrieldemarmiesse requested a review from a team as a code owner August 30, 2024 13:47

gabrieldemarmiesse mentioned this pull request Sep 9, 2024

[BUG] Mojo is ~7000x slower than python to run (str("hello world\n") * 10_000).split("\n") #3460

Closed

Merge branch 'nightly' into exact_atof

381d7fa

Merge branch 'nightly' into exact_atof

ec209df

soraros reviewed Sep 21, 2024

View reviewed changes

stdlib/src/collections/_parsing_numbers/parsing_floats.mojo Outdated Show resolved Hide resolved

gabrieldemarmiesse added 2 commits September 21, 2024 10:54

Merge branch 'nightly' into exact_atof

7b15904

Use memory.bitcast instead of using a pointer

8835005

Signed-off-by: gabrieldemarmiesse <[email protected]>

soraros mentioned this pull request Oct 1, 2024

[stdlib] Add a new hash function based on AHash algorithm #3476

Closed

gabrieldemarmiesse added 2 commits October 30, 2024 09:19

Merge branch 'nightly' into exact_atof

16451af

Signed-off-by: gabrieldemarmiesse <[email protected]>

Merge branch 'nightly' into exact_atof

2e9f94e

Merge branch 'nightly' into exact_atof

4adb15c

modular-automation bot assigned JoeLoser Nov 21, 2024

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Fix `atof` precision issues and use SOTA algorithm #3434

[stdlib] Fix `atof` precision issues and use SOTA algorithm #3434

gabrieldemarmiesse commented Aug 30, 2024 •

edited

Loading

gabrieldemarmiesse commented Aug 30, 2024

martinvuyk commented Sep 2, 2024

lemire commented Sep 2, 2024

gabrieldemarmiesse commented Sep 3, 2024

gabrieldemarmiesse commented Sep 3, 2024 •

edited

Loading

lemire commented Sep 3, 2024

gabrieldemarmiesse commented Sep 3, 2024

gabrieldemarmiesse commented Sep 13, 2024

lemire commented Sep 13, 2024

soraros commented Sep 19, 2024

lemire commented Sep 20, 2024

gabrieldemarmiesse commented Sep 20, 2024 •

edited

Loading

lemire commented Sep 20, 2024

soraros commented Oct 1, 2024 •

edited

Loading

gabrieldemarmiesse commented Oct 30, 2024

JoeLoser commented Nov 15, 2024

JoeLoser commented Nov 21, 2024

[stdlib] Fix atof precision issues and use SOTA algorithm #3434

Are you sure you want to change the base?

[stdlib] Fix atof precision issues and use SOTA algorithm #3434

Conversation

gabrieldemarmiesse commented Aug 30, 2024 • edited Loading

Goal

Implementation

Limitations

Performance

Correctness

Merging this PR

gabrieldemarmiesse commented Aug 30, 2024

martinvuyk commented Sep 2, 2024

lemire commented Sep 2, 2024

gabrieldemarmiesse commented Sep 3, 2024

gabrieldemarmiesse commented Sep 3, 2024 • edited Loading

lemire commented Sep 3, 2024

gabrieldemarmiesse commented Sep 3, 2024

gabrieldemarmiesse commented Sep 13, 2024

lemire commented Sep 13, 2024

soraros commented Sep 19, 2024

lemire commented Sep 20, 2024

gabrieldemarmiesse commented Sep 20, 2024 • edited Loading

lemire commented Sep 20, 2024

soraros commented Oct 1, 2024 • edited Loading

gabrieldemarmiesse commented Oct 30, 2024

JoeLoser commented Nov 15, 2024

JoeLoser commented Nov 21, 2024

[stdlib] Fix `atof` precision issues and use SOTA algorithm #3434

[stdlib] Fix `atof` precision issues and use SOTA algorithm #3434

gabrieldemarmiesse commented Aug 30, 2024 •

edited

Loading

gabrieldemarmiesse commented Sep 3, 2024 •

edited

Loading

gabrieldemarmiesse commented Sep 20, 2024 •

edited

Loading

soraros commented Oct 1, 2024 •

edited

Loading