Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Fix atof precision issues and use SOTA algorithm #3434

Open
wants to merge 8 commits into
base: nightly
Choose a base branch
from

Conversation

gabrieldemarmiesse
Copy link
Contributor

@gabrieldemarmiesse gabrieldemarmiesse commented Aug 30, 2024

Goal

Fix the correctness problem of the atof function.

Fixes the following issue:

Supersedes the following PR:

This pull request is a complete reimplementation of the atof function, which was returning incorrect numbers due to doing too many floating point operations and loosing accuracy in the process.

The primary goal of this PR was to provide a correct implementation and raising an error when the function cannot handle the input. Since this require a full rewriting and a quite complexe parsing, let's use a state of the art algorithm while we're at it.

The rational for going with state of the art directly is also that this function will be used massively everywhere when parsing json, xml, yaml, user inputs, csv... Thus performance is very important here. While we can't provide a very fast atof right now, we want to lay out the groundwork to make this possible.

As a side note, the error messages were improved.

Implementation

The algorithm chosen here was the one from the following papers:

The reference implementation used for this PR can be found here: https://github.com/CarlVerret/csFastFloat

The part about parsing digits was not taken from the paper and was written by myself, with a focus on using SIMD and avoid hard to predict branches.

Limitations

Performance

I'm not sure it would be interesting to talk about the performance benefits of this pull request because:

  1. The original atof was returning wrong values, and it's no use running a benchmark against an incorrect function.
  2. We have a few issues in the stdlib which tank the performance of this function, notably String.removesuffix() which returns a String instead of a StringSlice and the fact that String doesn't have short string optimization yet, meaning that calling String.removesuffix() do a memory allocation on the heap.
  3. To avoid many instructions, we need the intrinsics to do a multiplication (full_multiplication function) which takes two 64 bits integers and returns two 64 bits integers (high bits and low bits).
  4. Many constants were hardcoded in this pull request and could be adapted to different architectures and benchmarked independently. I'm notably thinking about the function converting a serie of digit characters into an UInt64. Generating those constants at compile-time is not possible yet as Mojo's compile-time interpreter has some progress to do first.

I expect that when more progress have been done on those issues, we can start benchmarking the atof function seriously and look into further optimization. The base algorithm is there, as well as the unit tests and so it should make the optimization step very easy.

Correctness

We can't currently process floating point representation which have a decimal significand that doesn't fit in an UInt64. e.g. 47421763.548648646474532187448684 because 47421763548648646474532187448684 doesn't fit in a UInt64.

Rather than returning a wrong number, we produce an error.

This limitation can surely be removed in the future, but was deemed out-of-scope for this pull request.

Merging this PR

  • While the diff can seems big, many of the lines added are constants, comments and licenses.
  • We can have an open discussion with the maintainers about how to split this PR. As you might know, I very much like small PRs too, but this is also good to have a bird view of the situation with this global PR. If you want to split it and have a specific split in mind, feel free to tell me.

@gabrieldemarmiesse gabrieldemarmiesse marked this pull request as ready for review August 30, 2024 13:47
@gabrieldemarmiesse gabrieldemarmiesse requested a review from a team as a code owner August 30, 2024 13:47
@gabrieldemarmiesse
Copy link
Contributor Author

@lemire We use your work here. If you have the time, feel free to give feedback! And congratulation for the papers, they're really easy to read!

@martinvuyk
Copy link
Contributor

Hi @gabrieldemarmiesse this is awesome, seems like you're on a crusade for performance in the stdlib and I love it 😆.

ideas:

  • Maybe adding @always_inline in a couple of places since these funcs are just helpers for 1 thing ?
  • could we throw this in the math package and just import atof in string.mojo?
    • some of these constant LUTs seem like better fitting the math module (since we are already spending the memory space for the Tables might as well recycle it in other places, dunno what algos would use them though 🤷‍♂️ )
    • Almost all of the functions are things that are rather bit manipulation or number "characterization" than having anything to do with String or collections.

@lemire
Copy link

lemire commented Sep 2, 2024

@gabrieldemarmiesse This looks like great work.

Have you tried benchmarking it, if only privately (for yourself)?

A file that I like to use for benchmarking is canada.txt: https://github.com/lemire/simple_fastfloat_benchmark/blob/master/data/canada.txt

I can help with the benchmarking if needed.

@gabrieldemarmiesse
Copy link
Contributor Author

@martinvuyk thanks for your review!

Maybe adding @always_inline in a couple of places since these funcs are just helpers for 1 thing ?

I will hold off manually inlining things until I can prove with a benchmark that it's worth it.

could we throw this in the math package and just import atof in string.mojo?

I don't have opinions in general about code organization (where we put functions, constants and such). So I'll wait for a stdlib maintainer to tell me if they have something specific in mind regarding this.

@gabrieldemarmiesse
Copy link
Contributor Author

gabrieldemarmiesse commented Sep 3, 2024

@lemire Thank you for taking a look at this!

Concerning benchmarking, I haven't done anything myself, but I'd love to.

If we want to benchmark we would have first to disable / adapt the functions that "malloc" in atof, notably change those functions:

fn strip_unused_characters(x: String) -> String:
    result = x.strip()                 # malloc + copy here
    result = result.removesuffix("f")  # malloc + copy here
    result = result.removesuffix("F") # malloc + copy here
    result = result.removeprefix("+") # malloc + copy here
    return result


fn get_sign(x: String) -> Tuple[Float64, String]:
    if x.startswith("-"):
        return (-1.0, x[1:]) # malloc + copy here
    return (1.0, x) # malloc + copy here

That's because the Mojo String type is not currently flushed out, and does a lot of copies where it's not needed.

If we do that, we could benchmark correctly. Though we wouldn't be able to compare to what's already in the stdlib as the current atof() has precision issues for big powers of 10 and thus can't be considered a "working baseline".

I think it's something that I will look into end of this week / next week as I'm currently busy with making the next community meeting talk and I have also this PR that I need to cleanup, benchmark and it also needs a good description:

This PR is also based on your work by the way :)

After I'm done with this, I'd be very happy to get some help with benchmarking. I'm sure there are things that I can improve in my method.

@lemire
Copy link

lemire commented Sep 3, 2024

@gabrieldemarmiesse We can do comparative benchmarking with other systems. That's why I was offering to help. We can know how good it can be, because we have other implementations. So if there are problems (performance problems), they can be identified. :-)

@gabrieldemarmiesse
Copy link
Contributor Author

Sure thing, that's also what I had in mind. I'll ping you when I'm done with the talk and the b64 PR :)

@gabrieldemarmiesse
Copy link
Contributor Author

@lemire FYI, benchmarking is on hold until #3460 is fixed, otherwise, loading canada.txt is way too slow (5 min+)

@lemire
Copy link

lemire commented Sep 13, 2024

too slow (5 min+)

:-/

@soraros
Copy link
Contributor

soraros commented Sep 19, 2024

The part about parsing digits was not taken from the paper and was written by myself, with a focus on using SIMD and avoid hard to predict branches.

Could you elaborate on this?

@lemire
Copy link

lemire commented Sep 20, 2024

@soraros

I suspect that @gabrieldemarmiesse refers to _get_w_and_q_from_float_string. You need to profile this code, but it is possible that it could become a bottleneck.

I have not studied it carefully, but it seems to do a first pass which maps to data to stack buffers, and then it does some conversion (to_integer).

I think that's one piece of code that might benefit from benchmarking and optimization.

(Note that I am not saying it is slow or suboptimal. I do not know.)

@gabrieldemarmiesse
Copy link
Contributor Author

gabrieldemarmiesse commented Sep 20, 2024

Indeed, @lemire you're right there. It's one piece of code that wasn't described in the paper, so I crafted something to parse the decimal exponent and significand using as much as possible the best practices (simd to reduce instructions, loop unrolling to avoid branches). That can also be seen as premature optimization in a way and would benefit from benchmarking/profiling, as the rest of the code.

I was digging into benchmarking, the mallocs are taking a significant portion of the time, unsurprisingly. We can fix that by doing some optimization on the String methods, like returning a string view instead of a new string in removesuffix.

atof should never allocate on the heap, unless it's for error messages, there is no reason to do so.

Those optimizations are really related to String and thus can be the subject of other PRs I believe. The main contribution here is that this atof implementation is bug-free, unlike the previous one in the stdlib.

@lemire
Copy link

lemire commented Sep 20, 2024

@gabrieldemarmiesse

My comments are not an objection to this PR, and I do recommend it gets merged...

I was only elaborating on the point you raised in the PR comment, so that people reading this understand the issue better. (@soraros asked a question.)

I think it is worth writing it out because folks might use this PR later for reference to understand where the implementation is at.

The string parsing component can be a bottleneck and is trickier than it seems. In Section 4 of the following article, string processing is discussed...
https://arxiv.org/pdf/2101.11408

The C# implementation of this approach is at https://github.com/CarlVerret/csFastFloat/blob/3752dc9b41570e3d2b9ab98db54ef8921474ddce/csFastFloat/Structures/ParsedNumberString.cs#L17

It should probably be considered as it has proven to be quite fast.

After merging this PR, I recommend trying other strategies for the string processing component with benchmarks. I'd be happy to contribute.

@soraros
Copy link
Contributor

soraros commented Oct 1, 2024

I have a version of the full_mul that's super fast and cross-platform (notice it returns SIMD[lo, hi]):

@always_inline("nodebug")
fn full_mul(lhs: UInt64, rhs: UInt64) -> UInt128:
  l = __mlir_op.`pop.cast`[_type = __mlir_type.`!pop.scalar<ui128>`](lhs.value)
  r = __mlir_op.`pop.cast`[_type = __mlir_type.`!pop.scalar<ui128>`](rhs.value)
  mul = __mlir_op.`pop.mul`(l, r)
  return __mlir_op.`pop.bitcast`[_type = __mlir_type.`!pop.simd<2, ui64>`](mul)

LLVM IR:

define dso_local <2 x i64> @full_mul(i64 noundef %0, i64 noundef %1) #0 {
  %3 = zext i64 %0 to i128
  %4 = zext i64 %1 to i128
  %5 = mul i128 %3, %4
  %6 = bitcast i128 %5 to <2 x i64>
  ret <2 x i64> %6
}

Assembly (basically single instruction):

full_mul_llvm:
  movq	%rsi, %rdx
  mulxq	%rdi, %rcx, %rax
  vmovq	%rax, %xmm0
  vmovq	%rcx, %xmm1
  vpunpcklqdq	%xmm0, %xmm1, %xmm0
  retq

@gabrieldemarmiesse
Copy link
Contributor Author

@soraros thank you for providing this code snippet! I believe UInt128 multiplication can be tackled in another PR and would deserve a public API (add a DType for it so that we can have simd support, etc...). Once this is done, we can use it in atof to improve the performance and that change can be the topic of another PR.

@JoeLoser
Copy link
Collaborator

@jackos given your recent work with floats, do you mind taking a look at this one, please?

@JoeLoser
Copy link
Collaborator

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants