Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] [proposal] Make String.split() and .splitlines() return List[StringSlice] #3879

Open
1 task done
martinvuyk opened this issue Dec 13, 2024 · 8 comments
Open
1 task done
Assignees
Labels
enhancement New feature or request mojo-repo Tag all issues with this label

Comments

@martinvuyk
Copy link
Contributor

martinvuyk commented Dec 13, 2024

Review Mojo's priorities

What is your request?

Make String.split() and .splitlines() return List[StringSlice]

What is your motivation for this change?

String.split() and .splitlines() are normally used as an intermediary step for other logic, I think once we setup some more implicit casting infrastructure the negative impact to existing code would be very minimal. Especially given that edge cases like:

data = "hello world"
lines = data.split(" ")
for i, l in enumerate(lines):
    l += "!"
    lines[i] = l

are quite simple to solve if we have an implicit constructor from List[String].__init__(self, other: List[StringSlice[_]]) and only the line var lines: List[String] = data.split(" ") would need to be changed. This would also allow function signatures that receive List[String] to be passed a List[StringSlice].

Other than inplace mutation of a string, most code shouldn't be too impacted by this deviation. If anyone can think of more edge cases please do write them bellow, I wouldn't want us to break every codebase out there because of a couple of "innocent" assumptions, and I'm not sure how "bad" it is to deviate from Python in this manner.

Any other details?

In some benchmarks I've been doing, our current splitlines implementation is around 2x slower than similar code in Python. If the Mojo code is changed to do .as_string_slice().splitlines() it's actually 16% faster than Python's implementation. My hypothesis is that it's because strings in python are copy-on-write, and as such when building a string from another they are virtually the same as a StringSlice.

Another option would be to make String be like Python's where the underlying buffer can either be owning or non-owning, using a data structure like what was proposed in #3797 (reference implementation in #3807)

@martinvuyk martinvuyk added enhancement New feature or request mojo-repo Tag all issues with this label labels Dec 13, 2024
@owenhilyard
Copy link
Contributor

I think I'd rather see .split() and .splitlines() return a lazy iterator, for the sake of avoiding allocations and since you can then use functional style code (potentially MT) to process your input.

@martinvuyk
Copy link
Contributor Author

@owenhilyard I'd also like to have an iterator for them (see PR #3858), but we can't deviate from Python APIs too much in functions that are used so ubiquitously like these two

@owenhilyard
Copy link
Contributor

@owenhilyard I'd also like to have an iterator for them (see PR #3858), but we can't deviate from Python APIs too much in functions that are used so ubiquitously like these two

We absolutely can. We can make it return a BufferIterator which can be implicitly casted into List[UInt8], and which supports indexing without consuming the iterator. I'd make the argument that frequently used functions are something we need to be sure won't have too much of a perf impact, so we should leverage the stronger type system to make things better where we can. If we don't do that, then we end up duplicating the API and causing more confusion when people are told to use the lazy_* versions of functions instead. We can make a script which helps migrate known python -> mojo API differences, and all that needs to do is make "a,b".split(',') into List[StringSlice]("a,b".split(',')). Think lib2to3, but it's libpytomojo. This means we can do minor API breaks with python and not be stuck with every decision they've made.

@martinvuyk
Copy link
Contributor Author

martinvuyk commented Dec 15, 2024

I agree in principle, but how would we set it up? It's not just a matter of implicit constructors in many use-cases for these functions

e.g. some python code which should work the same in Mojo:

if __name__ == "__main__":
    data = "some variable user input with their name: MyName"
    words = data.split()
    idx = words.index("name:")
    print("hi", words[idx + 1], "welcome to Mojo 🔥")

Iterators in Mojo edit the iterator on every __next__ method call (e.g. what if someone wanted to do idx -1 in the above example?), and we'd also need to add practically every List API to those iterators

@owenhilyard
Copy link
Contributor

We can make a trait for iterators over things that support random access which requires indexing and len(). There's not much else we can use that syntax for, so we may as well use it for that.

@martinvuyk
Copy link
Contributor Author

Ok that's an interesting abstraction. But if we're talking about concrete steps then we would need to implement:

  • _SplitlinesIter.__getitem__(idx :Int) -> StringSlice
  • _SplitIter.__getitem__(idx :Int) -> StringSlice
  • _StringSliceIter.__getitem__(idx :Int) -> StringSlice
  • implement a strided _StringSliceIter, _SplitlinesIter, and _SplitIter
    • _StringSliceIter.__getitem__(slice: Slice) -> _StridedStringSliceIter
    • _SplitIter.__getitem__(slice: Slice) -> _StridedSplitIter
    • _SplitlinesIter.__getitem__(slice: Slice) -> _StridedSplitlinesIter

Then we'd need to add implicit constructors for List receiving all of those iterators.

IMO we should wait until we have flexible enough Iterator generics, until then this would quickly become a bloated mess. But, we can also start slowly building up the infrastructure to switch to iterators eventually, and communicate to users the intent to do so clearly.

@owenhilyard
Copy link
Contributor

IMO we should wait until we have flexible enough Iterator generics, until then this would quickly become a bloated mess. But, we can also start slowly building up the infrastructure to switch to iterators eventually, and communicate to users the intent to do so clearly.

+1, I think there's a lot of stuff better done after some type system work.

@ConnorGray ConnorGray self-assigned this Dec 16, 2024
@ConnorGray
Copy link
Collaborator

Hey folks, I just wanted to communicate outward the results of a Mojo standard library team discussion about this proposal:

  • We are +1 to this narrow part of the proposal, and would accept PRs implementing this change:

    • Make String.split() and .splitlines() return List[StringSlice]

    • We think this is a reasonable API evolution towards removing unnecessary allocations from APIs.
  • We want to postpone any addition of a List[String].__init(List[StringSlice]) constructor. While we recognize the ergonomics advantage it can have in certain circumstances (as shown in the demonstrated examples), we think we want to explore the design space here further before incorporating a design that can perform allocations implicitly.

    • Personal comment: I think a fn to_owned_strings(self: List[StringSlice]) -> List[String] explicit convenience method might make sense in the short term, to obviate the ergonomics problems currently present in "mapping" a List of one type to another. Medium term, I think a List[T].map[U: AnyType, Func: fn(owned T) -> U]() -> List[U] would be a better general purpose solution.
  • I think we should postpone the use of iterators return values in split()-like methods. Iterators have obvious advantages, but Mojo's iterator model is currently not a well explored part of the design space, so I think we should hold off on "QoL" changes that thread iterators throughout existing methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mojo-repo Tag all issues with this label
Projects
None yet
Development

No branches or pull requests

3 participants