Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle two warnings that pollute the output of sssom-py CLI #561

Merged
merged 9 commits into from
Nov 9, 2024

Conversation

matentzn
Copy link
Collaborator

@matentzn matentzn commented Nov 8, 2024

See commits for a more detailed description of the changes;

This PR should not introduce any functional changes.

Warning: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.

I just to just not use "inplace" here and the warning does not happen.
Previously this warning happened:

FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df.replace("", np.nan, inplace=True)

setting the option stops Pandas from even trying to downcast (so it no longer has anything to warn about). I then force the downcast explicitly with infer_objects to ensure that the behaviour remains what it was, but not sure this is what I want here.
src/sssom/parsers.py Outdated Show resolved Hide resolved
df.infer_objects(copy=False) does not work in a platform independent manner; and I dont even know if I need it

# Context https://github.com/pandas-dev/pandas/issues/57734
try:
pd.set_option("future.no_silent_downcasting", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather put that somewhere when the library is initialized, so that we can be sure the option is enabled everywhere at all times and not only when the code path goes through this particular function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it to sssom/init.py, not sure this is the right thing to do (can this effect other libraries that might not want that option to be set for some reason?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally be fine having it in sssom/init.py. This is forcing the behaviour to apply globally, which I believe is a good thing.

That being said, it is true that this could have cascading effects for people who are using sssom as a library rather than as a command-line tool (people who import sssom in their own Python projects, rather than calling sssom-py from the command line), if they also happen to be using pandas in the rest of their code (no problem if their only use of pandas is through sssom).

If you’re not comfortable forcing the “no downcasting” behaviour to all potential users of the sssom Python module¹, I can think of two options:

(A) Setting the option somewhere at the beginning of the main function in sssom/cli.py. This means the option will be enabled globally in all of sssom when it is used as a CLI tool, but not when it is used as an imported module in another Python project.

It will then be up to the people who imports sssom to decide whether they themselves want to enable the “no downcasting” behaviour, by setting (or not) the option.

(B) Enabling the option within sssom, but only where we need it and only for the duration we need it. That is, we enable it before calling replace here, and disable it afterwards.

A context manager would come in handy here:

class pandas_no_silent_downcasting():

    def __init__(self):
        try:
            self._already_set = pd.get_option('future.no_silent_downcasting')
            self._supported = True
        except pd.errors.OptionError:
            self._supported = False

    def __enter__(self):
        if self._supported and not self._already_set:
            # Entering the context, set the option
            pd.set_option('future.no_silent_downcasting', True)

    def __exit__(self, type, value, tb):
        if self._supported and not self._already_set:
            # Leaving the context, unset the option
            pd.set_option('future.no_silent_downcasting', False)

It could then be used whenever we have to perform an operation and need to locally make sure the option (if it is available and if it is not already set) is enabled:

with pandas_no_silent_downcasting():
    df.replace(...)

This is equivalent to:

pd.set_option('future.no_silent_downcasting', True)
df.replace(...)
pd.set_option('future.no_silent_downcasting', False)

except that it is more “pythonic” and it deals gracefully with the possible absence of that option or the possibility that the option has already been enabled elsewhere.


¹ But it should be noted that this behaviour will be forced upon them sooner or later anyway, since it will be the only behaviour available in Pandas 3.0… Better for those people to start dealing with it now.

df.replace("", np.nan, inplace=True)
# df.infer_objects(copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer_objects does not work “in place” (and it doesn’t have a inplace keyword to force it to do so), so that call does nothing since you’re not collecting the results.

You’d need

df = df.infer_objects(copy=False)

As for compatibility with various versions of Pandas: the copy parameter was introduced in Pandas 2.0. You could leave it out:

df = df.infer_objects()

Without that copy=False, the default behaviour of infer_objects would be to make a copy of all values in the frame, even if they are not casted to another type. But this is not a concern here since you are not keeping the original frame anyway.

As for whether the downcasting is necessary to begin with, well, that would depend on the rest of the code downstream from here. For now (before this PR), the data frame is downcasted, as a side-effect of replace. The question is, is there some code downstream that is depending on the dataframe columns having more precise types than object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it back with a pandas version check, can you see what you think?

Copy link
Contributor

@gouttegd gouttegd Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but as I said you could also have just left the copy parameter out.

For users of Pandas 1.x, this does not change anything. A copy of the non-type-casted objects will be made no matter what, that’s the only available behaviour.

For users of Pandas 3.x, this also does not change anything, since copy-on-write means that even if a copy of the non-type-casted objects is made, the actual copy would never actually take place (since we’re re-assigning the inferred data frame to the original variable).

For users of Pandas 2.x, no copy=False parameter means that the non-type-casted objects would be copied for nothing, which strictly speaking is a waste of time and memory. But I strongly suspect the effects on performances would be negligible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Side note: I love the “should GitHub still exist when you read this”. :D )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I just want to do the thing that changes least about the existing behaviour - which I thought I understand form your explanations would be copy=False for Pandas 2.x?

Copy link
Contributor

@gouttegd gouttegd Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do

df = df.infer_objects(copy=False)

then copy=False makes absolutely no difference to the result. All it does is making the operation slightly faster (which is why I included it in my example on Slack, but I now regret having done that if it led you to believe it was absolutely necessary), but the data frame you get at the end will be exactly the same as if you omitted the copy parameter.

What does the copy parameter does exactly?

Consider the following data frame:

>>> df
         A   B
0  alice     1
1  bob       2
2  charlie     

Let’s create df1 where we replace the empty cell (B2) with np.nan:

>>> pd.set_option('future.no_silent_downcasting', True)
>>> df1 = df.replace("", np.nan)
>>> df1
         A    B
0  alice      1
1  bob        2
2  charlie  NaN

Now let’s downcast, first with copy=True (default behaviour, same as no copy parameter at all):

>>> df2_copy_true = df1.infer_objects(copy=True)
>>> df2_copy_true
         A     B
0  alice     1.0
1  bob       2.0
2  charlie   NaN

Then with copy=False:

>>> df2_copy_false = df1.infer_objects(copy=False)
>>> df2_copy_false
         A     B
0  alice     1.0
1  bob       2.0
2  charlie   NaN

Observe that df2_copy_true and df2_copy_false are the same. So what’s the difference?

The difference is that the A column in df2_copy_false share its data with the A column in the original df1 frame: A copy was not made, since there was no downcasting needed for that column.

This means that if you modify in place a value in the A column of df2_copy_false, this will also modify the original df1:

>>> df2_copy_false.replace("alice", "arthur", inplace=True)
>>> df2_copy_false
         A     B
0  arthur    1.0
1  bob       2.0
2  charlie   NaN
>>> df1
         A     B
0  arthur      1
1  bob         2
2  charlie   NaN

By contrast, df2_copy_true is completely distinct from the original df1. Even if the A column needed no downcast, infer_objects(copy=True) made a complete copy of it.

From the above, you should understand that you do not need to worry about getting different results depending on the presence of copy=False if all you do is re-assigning the output of infer_objects() to the original variable – after the assignment the original data frame is no longer there anyway.

So when you do this:

df = df.infer_objects()

the only impact of the absence of copy=False is that you will needlessly make a temporary copy of columns that didn’t need to be copied, so it’s slightly less efficient than

df = df.infer_objects(copy=False)

Whether you should be concerned about this inefficiency is up to you. I am personally not concerned enough to think that it warrants making the code more complicated by adding this _safe_infer_objects intermediate, but it’s your call.

In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.

Copy link
Contributor

@gouttegd gouttegd Nov 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.

I would in fact be keen to enable copy-on-write globally (pd.set_option('mode.copy_on_write', True)) right now, both for the performance improvement and to start making sure that we are ready to deal with it since it will become mandatory with Pandas 3.0.

But the concerns stated above for the people who use sssom as a Python library would obviously also apply here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would in fact be keen to enable copy-on-write globally

Wait, we are in fact already enabling copy-on-write:

pd.options.mode.copy_on_write = True

So there’s definitely no need for copy=False when we call infer_objects() here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for taking the time to explaining this! Really much appreciated!

As @gouttegd suggested:

> I would rather put that somewhere when the library is initialized, so that we can be sure the option is enabled everywhere at all times and not only when the code path goes through this particular function.

I moved that code to __init__.py hopeing that is the right place for it.
To ensure the old behaviour is retained, @gouttegd convinced me to not be lazy and actually add the "df.infer_objects" functionality when replacing empty values with nan's. Should github still exist when you read this, you can find his feedback here: https://github.com/mapping-commons/sssom-py/pull/561/files#r1835148927

As he questions, it is not clear this is _strictly needed_, but its probably safer to so if we dont know if its needed.
The rationale is to not force this global option on all users of the library. The main point for the fix is so that sssom-py's CLI print to stdout functionality is not compromised by this output!
As explained by @gouttegd in #561 (comment), we do not actually benefit from the small performance gain enabled by copy=False here, so we decided to drop it (which means a considerable simplification of the code as well)
@matentzn matentzn enabled auto-merge (squash) November 9, 2024 15:00
@matentzn matentzn merged commit 6ce01a8 into master Nov 9, 2024
8 checks passed
@matentzn matentzn deleted the rm-warnings-sssom branch November 9, 2024 15:04
@gouttegd
Copy link
Contributor

gouttegd commented Nov 9, 2024

@matentzn I really appreciate that you took the time to write meaningful commit messages in this PR, but you must realize that your commit messages are all lost if you do a squash merge after that!

All that is now left in the history is a single commit with a decidedly non-meaningful message:

See commits for a more detailed description of the changes;

This PR should not introduce any functional changes.

When you do a squash merge, you should ensure that the commit message of the single resulting commit accurately describes the final changes.

@matentzn
Copy link
Collaborator Author

matentzn commented Nov 9, 2024

See slack...

matentzn added a commit that referenced this pull request Nov 9, 2024
matentzn added a commit that referenced this pull request Nov 9, 2024
See commits for a more detailed description of the changes;

This PR should not introduce any functional changes.
matentzn added a commit that referenced this pull request Nov 9, 2024
In this commit we

1. introduce pd.set_option("future.no_silent_downcasting", True) to heed the warning pandas started recently producing:

FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  df.replace("", np.nan, inplace=True)

2. Fix a case where we were using inplace=True when updating a pandas data frame in the wrong way, leading to the following warning. We used the opportunity to simplify that piece of code as well, thanks @gouttegd

Warning: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.

3. lastly, we simplified some code where we introduced pandas.DataFrame.infer_objects(), which previously had a complicated conditional depending on the specific pandas version on whether the copy=False parameter is needed. It is not needed, because we also set pandas.mode.copy_on_write  to true in #4cad7d6d8b905728f14a90d3d4c6d591b520cd3b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants