Handle two warnings that pollute the output of sssom-py CLI #561

matentzn · 2024-11-08T21:45:22Z

See commits for a more detailed description of the changes;

This PR should not introduce any functional changes.

Warning: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy. I just to just not use "inplace" here and the warning does not happen.

Previously this warning happened: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` df.replace("", np.nan, inplace=True) setting the option stops Pandas from even trying to downcast (so it no longer has anything to warn about). I then force the downcast explicitly with infer_objects to ensure that the behaviour remains what it was, but not sure this is what I want here.

src/sssom/parsers.py

df.infer_objects(copy=False) does not work in a platform independent manner; and I dont even know if I need it

gouttegd · 2024-11-08T23:03:07Z

src/sssom/util.py

+
+        # Context https://github.com/pandas-dev/pandas/issues/57734
+        try:
+            pd.set_option("future.no_silent_downcasting", True)


I would rather put that somewhere when the library is initialized, so that we can be sure the option is enabled everywhere at all times and not only when the code path goes through this particular function.

I added it to sssom/init.py, not sure this is the right thing to do (can this effect other libraries that might not want that option to be set for some reason?)

I would personally be fine having it in sssom/init.py. This is forcing the behaviour to apply globally, which I believe is a good thing.

That being said, it is true that this could have cascading effects for people who are using sssom as a library rather than as a command-line tool (people who import sssom in their own Python projects, rather than calling sssom-py from the command line), if they also happen to be using pandas in the rest of their code (no problem if their only use of pandas is through sssom).

If you’re not comfortable forcing the “no downcasting” behaviour to all potential users of the sssom Python module¹, I can think of two options:

(A) Setting the option somewhere at the beginning of the main function in sssom/cli.py. This means the option will be enabled globally in all of sssom when it is used as a CLI tool, but not when it is used as an imported module in another Python project.

It will then be up to the people who imports sssom to decide whether they themselves want to enable the “no downcasting” behaviour, by setting (or not) the option.

(B) Enabling the option within sssom, but only where we need it and only for the duration we need it. That is, we enable it before calling replace here, and disable it afterwards.

A context manager would come in handy here:

class pandas_no_silent_downcasting(): def __init__(self): try: self._already_set = pd.get_option('future.no_silent_downcasting') self._supported = True except pd.errors.OptionError: self._supported = False def __enter__(self): if self._supported and not self._already_set: # Entering the context, set the option pd.set_option('future.no_silent_downcasting', True) def __exit__(self, type, value, tb): if self._supported and not self._already_set: # Leaving the context, unset the option pd.set_option('future.no_silent_downcasting', False)

It could then be used whenever we have to perform an operation and need to locally make sure the option (if it is available and if it is not already set) is enabled:

with pandas_no_silent_downcasting(): df.replace(...)

This is equivalent to:

pd.set_option('future.no_silent_downcasting', True) df.replace(...) pd.set_option('future.no_silent_downcasting', False)

except that it is more “pythonic” and it deals gracefully with the possible absence of that option or the possibility that the option has already been enabled elsewhere.

¹ But it should be noted that this behaviour will be forced upon them sooner or later anyway, since it will be the only behaviour available in Pandas 3.0… Better for those people to start dealing with it now.

gouttegd · 2024-11-08T23:18:32Z

src/sssom/util.py

        df.replace("", np.nan, inplace=True)
+        # df.infer_objects(copy=False)


infer_objects does not work “in place” (and it doesn’t have a inplace keyword to force it to do so), so that call does nothing since you’re not collecting the results.

You’d need

df = df.infer_objects(copy=False)

As for compatibility with various versions of Pandas: the copy parameter was introduced in Pandas 2.0. You could leave it out:

df = df.infer_objects()

Without that copy=False, the default behaviour of infer_objects would be to make a copy of all values in the frame, even if they are not casted to another type. But this is not a concern here since you are not keeping the original frame anyway.

As for whether the downcasting is necessary to begin with, well, that would depend on the rest of the code downstream from here. For now (before this PR), the data frame is downcasted, as a side-effect of replace. The question is, is there some code downstream that is depending on the dataframe columns having more precise types than object?

I added it back with a pandas version check, can you see what you think?

This is fine, but as I said you could also have just left the copy parameter out.

For users of Pandas 1.x, this does not change anything. A copy of the non-type-casted objects will be made no matter what, that’s the only available behaviour.

For users of Pandas 3.x, this also does not change anything, since copy-on-write means that even if a copy of the non-type-casted objects is made, the actual copy would never actually take place (since we’re re-assigning the inferred data frame to the original variable).

For users of Pandas 2.x, no copy=False parameter means that the non-type-casted objects would be copied for nothing, which strictly speaking is a waste of time and memory. But I strongly suspect the effects on performances would be negligible.

(Side note: I love the “should GitHub still exist when you read this”. :D )

For now I just want to do the thing that changes least about the existing behaviour - which I thought I understand form your explanations would be copy=False for Pandas 2.x?

If you do

df = df.infer_objects(copy=False)

then copy=False makes absolutely no difference to the result. All it does is making the operation slightly faster (which is why I included it in my example on Slack, but I now regret having done that if it led you to believe it was absolutely necessary), but the data frame you get at the end will be exactly the same as if you omitted the copy parameter.

What does the copy parameter does exactly?

Consider the following data frame:

>>> df A B 0 alice 1 1 bob 2 2 charlie

Let’s create df1 where we replace the empty cell (B2) with np.nan:

>>> pd.set_option('future.no_silent_downcasting', True) >>> df1 = df.replace("", np.nan) >>> df1 A B 0 alice 1 1 bob 2 2 charlie NaN

Now let’s downcast, first with copy=True (default behaviour, same as no copy parameter at all):

>>> df2_copy_true = df1.infer_objects(copy=True) >>> df2_copy_true A B 0 alice 1.0 1 bob 2.0 2 charlie NaN

Then with copy=False:

>>> df2_copy_false = df1.infer_objects(copy=False) >>> df2_copy_false A B 0 alice 1.0 1 bob 2.0 2 charlie NaN

Observe that df2_copy_true and df2_copy_false are the same. So what’s the difference?

The difference is that the A column in df2_copy_false share its data with the A column in the original df1 frame: A copy was not made, since there was no downcasting needed for that column.

This means that if you modify in place a value in the A column of df2_copy_false, this will also modify the original df1:

>>> df2_copy_false.replace("alice", "arthur", inplace=True) >>> df2_copy_false A B 0 arthur 1.0 1 bob 2.0 2 charlie NaN >>> df1 A B 0 arthur 1 1 bob 2 2 charlie NaN

By contrast, df2_copy_true is completely distinct from the original df1. Even if the A column needed no downcast, infer_objects(copy=True) made a complete copy of it.

From the above, you should understand that you do not need to worry about getting different results depending on the presence of copy=False if all you do is re-assigning the output of infer_objects() to the original variable – after the assignment the original data frame is no longer there anyway.

So when you do this:

df = df.infer_objects()

the only impact of the absence of copy=False is that you will needlessly make a temporary copy of columns that didn’t need to be copied, so it’s slightly less efficient than

df = df.infer_objects(copy=False)

Whether you should be concerned about this inefficiency is up to you. I am personally not concerned enough to think that it warrants making the code more complicated by adding this _safe_infer_objects intermediate, but it’s your call.

In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.

In any case, the inefficiency will go away with Pandas 3.0 and its copy-on-write mechanism.

I would in fact be keen to enable copy-on-write globally (pd.set_option('mode.copy_on_write', True)) right now, both for the performance improvement and to start making sure that we are ready to deal with it since it will become mandatory with Pandas 3.0.

But the concerns stated above for the people who use sssom as a Python library would obviously also apply here.

I would in fact be keen to enable copy-on-write globally

Wait, we are in fact already enabling copy-on-write:

pd.options.mode.copy_on_write = True

So there’s definitely no need for copy=False when we call infer_objects() here.

Thanks so much for taking the time to explaining this! Really much appreciated!

@gouttegd

As @gouttegd suggested: > I would rather put that somewhere when the library is initialized, so that we can be sure the option is enabled everywhere at all times and not only when the code path goes through this particular function. I moved that code to __init__.py hopeing that is the right place for it.

@gouttegd

As per @gouttegd suggestion here https://github.com/mapping-commons/sssom-py/pull/561/files#r1835130707, the codeblock has been refactored and simplified.

@gouttegd

To ensure the old behaviour is retained, @gouttegd convinced me to not be lazy and actually add the "df.infer_objects" functionality when replacing empty values with nan's. Should github still exist when you read this, you can find his feedback here: https://github.com/mapping-commons/sssom-py/pull/561/files#r1835148927 As he questions, it is not clear this is _strictly needed_, but its probably safer to so if we dont know if its needed.

The rationale is to not force this global option on all users of the library. The main point for the fix is so that sssom-py's CLI print to stdout functionality is not compromised by this output!

@gouttegd

As explained by @gouttegd in #561 (comment), we do not actually benefit from the small performance gain enabled by copy=False here, so we decided to drop it (which means a considerable simplification of the code as well)

gouttegd · 2024-11-09T15:23:57Z

@matentzn I really appreciate that you took the time to write meaningful commit messages in this PR, but you must realize that your commit messages are all lost if you do a squash merge after that!

All that is now left in the history is a single commit with a decidedly non-meaningful message:

See commits for a more detailed description of the changes;

This PR should not introduce any functional changes.

When you do a squash merge, you should ensure that the commit message of the single resulting commit accurately describes the final changes.

matentzn · 2024-11-09T15:26:05Z

See slack...

…561)" This reverts commit 6ce01a8.

See commits for a more detailed description of the changes; This PR should not introduce any functional changes.

@gouttegd

In this commit we 1. introduce pd.set_option("future.no_silent_downcasting", True) to heed the warning pandas started recently producing: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)` df.replace("", np.nan, inplace=True) 2. Fix a case where we were using inplace=True when updating a pandas data frame in the wrong way, leading to the following warning. We used the opportunity to simplify that piece of code as well, thanks @gouttegd Warning: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method. When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy. 3. lastly, we simplified some code where we introduced pandas.DataFrame.infer_objects(), which previously had a complicated conditional depending on the specific pandas version on whether the copy=False parameter is needed. It is not needed, because we also set pandas.mode.copy_on_write to true in #4cad7d6d8b905728f14a90d3d4c6d591b520cd3b.

matentzn added 2 commits November 8, 2024 23:38

matentzn mentioned this pull request Nov 8, 2024

suppress the FutureWarning: Downcasting behavior in replace is depr… #560

Closed

joeflack4 reviewed Nov 8, 2024

View reviewed changes

src/sssom/parsers.py Outdated Show resolved Hide resolved

comment out df.infer_objects(copy=False)

a13e224

df.infer_objects(copy=False) does not work in a platform independent manner; and I dont even know if I need it

gouttegd reviewed Nov 8, 2024

View reviewed changes

matentzn added 3 commits November 9, 2024 11:29

Simplify confidence preprocessing code in from_sssom_dataframe

d0051bc

As per @gouttegd suggestion here https://github.com/mapping-commons/sssom-py/pull/561/files#r1835130707, the codeblock has been refactored and simplified.

matentzn requested a review from gouttegd November 9, 2024 09:38

matentzn added 3 commits November 9, 2024 16:33

Move future.no_silent_downcasting to cli.py

634b1d9

The rationale is to not force this global option on all users of the library. The main point for the fix is so that sssom-py's CLI print to stdout functionality is not compromised by this output!

Dropping copy=False from pd.DataFrame.infer_objects

d2b244e

As explained by @gouttegd in #561 (comment), we do not actually benefit from the small performance gain enabled by copy=False here, so we decided to drop it (which means a considerable simplification of the code as well)

Undo change to a data file which does not belong in this PR

fbee866

gouttegd approved these changes Nov 9, 2024

View reviewed changes

matentzn requested a review from gouttegd November 9, 2024 14:49

matentzn enabled auto-merge (squash) November 9, 2024 15:00

matentzn disabled auto-merge November 9, 2024 15:00

matentzn merged commit 6ce01a8 into master Nov 9, 2024
8 checks passed

matentzn deleted the rm-warnings-sssom branch November 9, 2024 15:04

matentzn added a commit that referenced this pull request Nov 9, 2024

Revert "Handle two warnings that pollute the output of sssom-py CLI (#…

368f7f5

…561)" This reverts commit 6ce01a8.

matentzn added a commit that referenced this pull request Nov 9, 2024

Handle two warnings that pollute the output of sssom-py CLI (#561)

40b97ac

See commits for a more detailed description of the changes; This PR should not introduce any functional changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle two warnings that pollute the output of sssom-py CLI #561

Handle two warnings that pollute the output of sssom-py CLI #561

matentzn commented Nov 8, 2024

gouttegd Nov 8, 2024

matentzn Nov 9, 2024

gouttegd Nov 9, 2024

gouttegd Nov 8, 2024

matentzn Nov 9, 2024

gouttegd Nov 9, 2024 •

edited

Loading

gouttegd Nov 9, 2024

matentzn Nov 9, 2024

gouttegd Nov 9, 2024 •

edited

Loading

gouttegd Nov 9, 2024 •

edited

Loading

gouttegd Nov 9, 2024

matentzn Nov 9, 2024

gouttegd commented Nov 9, 2024

matentzn commented Nov 9, 2024

		df.replace("", np.nan, inplace=True)
		# df.infer_objects(copy=False)

Handle two warnings that pollute the output of sssom-py CLI #561

Handle two warnings that pollute the output of sssom-py CLI #561

Conversation

matentzn commented Nov 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gouttegd Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gouttegd Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

gouttegd Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gouttegd commented Nov 9, 2024

matentzn commented Nov 9, 2024

gouttegd Nov 9, 2024 •

edited

Loading

gouttegd Nov 9, 2024 •

edited

Loading

gouttegd Nov 9, 2024 •

edited

Loading