-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Status of this Project? #531
Comments
To the best of my knowledge, the answer is as follows. Short answer: yes, seems nobody to work on the project actively at the time. Long answer: The library was created by @v0dro and since some point, evolved by contributions of SciRuby's students mostly. I was involved with the library development at some point (mostly as a SciRuby mentor), but since I am retired from SciRuby, I am not anymore. Unfortunately, I lost touch with Sameer (@v0dro) since then, the last I know he was working on Rubex, which also seems dormant now. Considering his latest blog entries, it seems that he (like me before that) became disillusioned in prospects of gaining attention to Ruby from the scientific community, and switched to Python, but it is just my assumptions. |
Hey @zverok, thanks for the update, and all of your work on SciRuby. Since it sounds like Daru is in a holding pattern, I put together a take on data frames for Ruby called Rover. It uses Numo internally for performance (similar to how Pandas uses NumPy), and I've moved Prophet to it, as well as added support to XGBoost and LightGBM. I think there's a lot more to do, but hopefully it provides a good starting point. Would love to hear others' thoughts on the topic - feel free to open an issue to discuss. |
@ankane I'm curious, would you mind giving any more background - where you see the rover project going, why the start from scratch approach and how you think it compares to Daru (and Pandas) currently? |
@baarkerlounger The end goal is to make data analysis and machine learning as enjoyable as possible in Ruby. The main difference from Daru is its based on Numo, for both performance and easy integration with Rumale. |
@ankane, |
@arbox I'm happy to chat on specific features / ideas, but want to give this a fresh look (and think there's limited benefit to reusing the codebase). |
My favorite Ruby's fancy DataFrame looks like this. It's a bit like Daru at first sight. but, it's much faster because it uses numo-narray or Apache Arrow on the backend. The features are simple and it looks clean. Let's consider modifying Daru. In this case, you need to remove a lot of Daru's code. So it's reasonable that someone starts new projects. (Ruby is an object-oriented language. How do we integrate with the functions to create an easy-to-use data frame? I think this is the big mystery that current version of Daru has left behind... It should be lots of different ideas and works. ) |
Here is my take. There is a lot to like about this project. The code is very clean and has some nice abstractions. However, as I stated i the initial question there are some challenges today in terms of dependencies and support for Ruby > 2.4. @ankane I'd disagree with you that there is little to be benefit from the existing code base. There are a lot of features you don't have in Rover yet. Plus the implementation you have at the moment is strongly coupled to Numo. @kojix2 We can solve the problems you mention. I don't think its removing a lot of code (I share a desire for this direction but don't believe its necessary: https://github.com/kojix2/chai) Also I'm sure we can solve 'non founder' ownership and breaking changes it happens all the time.
I agree with this but lets make sure we build a mountain and don't end up with a bunch of half finished hills 😄 IMO there is value in a Ruby implementation of DataFrames. For me its a convenient way to transport and mutate 'matrix' like data. It's not required that it be blazingly fast of have amazing scale in the data sets it runs. Nor IMO is it a requirement that you be able to do everything in a Jupyter notebook. At a certain point if you want all those features the simpler leap is to go you Python and Pandas. However, I'd love to see this project get a bit more life into it and to update it to make it more usable. My wish list would be:
I don't think any of this is particularly difficult. There is a pretty long list of features to close the gap to Pandas but they can be easily done one at a time with easy to review PRs. I'd be happy to contribute I'd be happy to lead but I don't have the bandwidth do it alone. If there are others willing to help we can do it. @kojix2 @ankane @arbox @baarkerlounger and others are you up for giving it a go? |
Just to add one more variety to already existing list of opinions! I believe there are two, not really related questions about "dataframe implementation" in Ruby:
My interest was always around (1): how may we create a generally useful dataframe, which would be obviously helpful for everyone in the community (in fact, my involvement with Daru started with this 4-year old blog post, and that was the direction of my work in a maintainer role). "Generally useful", for me, also implies "naturally playing" with all Rubyist intuitions, emerging from Enumerable, Hash, Array etc. I once dreamed of ever-present "dataframe API" that could've suite "backends" as different as SQL db, CSV file or Apache Arrow. And I imagine that this work might also lead to accepting some useful "micro-metaphors" as useful idioms for other collections and libraries. For what I see, most of the other people involved in Daru's or other dataframe libraries' development mostly focused on datframes that would be a) fast for large data processing; b) include wast set of features (with emphasis on "it is possibe/done one method" even at a price of API consistency and knowledgeability); c) frequently, look familiar rather for "those who already worked with Pandas" than "those who already worked with Ruby Array/Hash". I'd imagine some ideal several layers project, with:
I might see that there are some parts of Daru that might be "scraped" for "layer 2" (various features), but in terms of API, I never really liked it, and don't believe it can be "gradually improved" (I had tried it for several monthes in 2018 and finally just gave up). TBH, I even have my own (unfinished and unpublished) "completely new DF project" with API I really love, but ... I don't see an audience for it anymore. |
@zverok Thanks for sharing your perspective. I share a very keen desire for #1 too. That seems fundamental. As I said if you really care about performance/scale then IMO you're going to get better luck by using Pandas. It's likely to always be faster etc. That doesn't stop us from having "fast" options in Ruby but that speed shouldn't come at the cost of API clarity. I think there is a balance to be struck between having a similar API to Pandas and feeling like Ruby. There is huge value in a Ruby DataFrame that at a minimum has the same features, with the same names as Pandas. Perhaps this is a simple as naming:
Having used Daru DataFrames in a project in anger I haven't felt the same sense of "its not Ruby" that you suggest. I'd like to understand that more. So two followup questions:
|
@jonspalmer It is hard for me to answer the questions in details: both technically, because I haven't been involved with Daru for ~2 years now, and ethically, as a lot of its features either was implemented by Sameer, who is the initial author or by students of SciRuby we were mentoring, so pointing at any particular feature would be also pointing at a person who invented/implemented it. Also, I have changed my mind about "what's best" several times, and I am currently even not sure whether "really useful" solution will be one class (DataFrame) or some family of classes (like, "table which is first and foremost for navigation/presentation" and "table that is first and foremost for complex math"). I suspect that the base API design decisions are very fundamental and it is hard to change them once settled. Between those, I'd say: indexing/enumerating; what are "indexes" (one or two of them, nesting of indexes and other behaiors), desired level of "matematicity" of the DF (whether |
IMO the answer on this is very clear - one class. The Panda's api is strong in this regard. From my perspective there isn't any need for more than two main objects DataFrame and Vector (and perhaps we call it Vector because that's what Ruby's Matrix class uses vs Panda's Series). Having the arithmetic and statistical functions as first class methods on DataFrame and Vector is simple and doesn't get in the way if you don't need it. For things like WRT indexing - I'm not sure there is a Ruby right or wrong answer. You could argue that we should mimic the dual index treatment in the Ruby Matrix class but honestly its not obviously right either. Consistency with that API doesn't bother me too much. To summarize I don't see these as big challenges. We could just decide and start moving things to a better place. |
@jonspalmer I believe you seriously underestimate the design space (and how design decisions affect the library's usability). Let's stick just to one example of two parts (it is just to illustrate the point). Imagine you have data shaped this way:
Question 1 (a more simple one, but very first in design, which every DF designer handles somehow): how do you address "Total" column and "Total" row? Everybody tends to start with... df['Total'] # column or row? ...and there are many different ways to handle it :) Now, to dive into some details. (For the sake of simplicity, let's say we decided that
I am honestly not sure the "whatever, let's design it someway and that will be it" is the approach that will lead anywhere useful. In fact, several existing (and incompatible) dataframe libraries that Ruby has (besides Daru and new kid Rover, there are ... some: 1, 2, 3 (at one point endorsed by Ruby Association), 4, 5, etc.) clearly demonstrate that. |
Answer:
a) most APIs would expect
we have the column copy problem again. You're to your larger example: col = df.col['Total']
newCol = ((col + 10) / 2).round(3).clamp(0..100)
# or
newCol = col.add(10).div(2).round(3).clamp(0..100)
# is this more efficient?
newCol = col.map { |v| ((v +10)/2).round(3).clamp(0..100) } I don't know which would be faster/more efficient. Whatever the answer it would need to be carefully measured. The DataFrame API can't anticipate all the use cases it can only provide reasonable, well named building blocks to allow options for the consumer to use to solve their particular problem. Are there cases where you really, really want to do things 'in place'? Perhaps but the use cases are going to be very specific and the 'right way' to optimize them is going to be very subtle. To take your example of wanting to "add 10 to Total". You could argue that we really need to manipulate "Total" in place because it's more efficient. However, its more likely that the situation is something like this:
So now you could say "Hey I really want to add 10 to "Total" in place. It sucks that this is 'so inefficient'" df.col['Total'] = df.col['Total'] + 10 but that's the wrong problem to go after. It would be way better to simply fix it when you generate 'Total' the first time. (it's user error not API error) df.col['Total'] = df.col[['Q1', 'Q2', 'Q3']].sum(axis: :column) + 10
df.row['Total'] = df.sum(axis: :row) Which doesn't require inefficient intermediate columns/rows. My argument isn't that we should blindly build an API and hope it works out. Instead we should carefully build something that is clear, flexible and consistent that allows consumers to solve their problems. We cannot nor should not expect the API to solve every corner case cleanly or naturally. Specific problem will require specific solutions. The more specific the problem the less likely the solution will be "elegant" but that's an entirely normal and expected tradeoff. From my perspective a lot of work and real world use cases have gone into the Pandas API and it is powerful, feature rich and natural to use. The choices and design there would be very natural to replicate in Ruby (with the exception perhaps being the Python slice operator being a bit more flexible than Ruby ranges). Daru has made a great initial set of steps to get close to replicating that API. IMO we should continue that work and bring it up to date with the current state or Ruby and Pandas. I'm not totally clear what you are proposing as an alternative? |
Status update on daru. I had a brief e-mail conversation with @jonspalmer and have promised him to update daru on the following points:
However, I am currently in grad school and using a lot of low level C++/C/FORTRAN for my research and have therefore lost touch with data analysis. I will happy if someone else is willing to take over the project. As @kojix2 says, having a dataframe with support for Arrow would be great. Relying on Ruby for speed is a bad idea. However all this will require a central point of contact (i.e. a maintainer willing to commit a few hours a week). My take on the future direction of daru is that we should forget the scientific computing audience and let them be happy with Julia/Python/R. Our real audience should lie in the Ruby community (web dev etc.). This was pointed out by @zverok much earlier and I agree that his course of action would have been appropriate. BTW Numo has some speed issues due to the data representation that it uses (last I checked was more than 6 months ago) and I'm not sure if they've been resolved. @prasunanand can fill in on this better, I believe. |
New daru version has been released and all old PRs have been merged/closed. |
@v0dro Following up on this what is the new version that has been released? I don't see any new tags here. |
My bad I forgot to tag it. You can see it on rubygems here: https://rubygems.org/gems/daru Version 0.3 |
Can anyone share thoughts on the state of this project?
This project contains a lot of amazing work and I'd love to be able to use it in a bunch of Ruby/Rails projects. However, there are a few issues:
Questions:
I'd be happy to contribute ideas and time on plotting a path forward but would love to be part of team that is driving towards that.
The text was updated successfully, but these errors were encountered: