Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Package returns NaN when it probably should return missing #20

Open
nilshg opened this issue Aug 13, 2024 · 3 comments
Open

Package returns NaN when it probably should return missing #20

nilshg opened this issue Aug 13, 2024 · 3 comments

Comments

@nilshg
Copy link

nilshg commented Aug 13, 2024

julia> get_prices("KICL.BO"; startdt = Date(2022, 11, 3), enddt = Date(2022, 11, 8))
OrderedCollections.OrderedDict{String, Any} with 8 entries:
  "ticker"    => "KICL.BO"
  "timestamp" => [DateTime("2022-11-03T03:45:00"), DateTime("2022-11-04T03:45:00"), DateTime("2022-11-07T03:45:00")]
  "open"      => [1900.0, NaN, 1878.0]
  "high"      => [1900.0, NaN, 1881.0]
  "low"       => [1876.0, NaN, 1871.85]
  "close"     => [1876.0, NaN, 1873.55]
  "adjclose"  => [1876.0, NaN, 1873.55]
  "vol"       => [23.0, NaN, 112.0]

Looking at the YFinance website I see:

image

So it looks like data is missing here for some reason. In Julia this is normally indicated by missing values - there is in principle a price for the security on that day, but for some reason it isn't reported in YFinance.

NaN should be reserved for results of computations that yield NaN like 0/0 (see e.g. this old discussion).

NaN also has the unhelpful side effect that it silently propagates through many operations, leading to often surprising errors way down a call stack because unsuspecting users (like myself) see that the data returned doesn't have missing values and therefore assume that it is complete.

@eohne
Copy link
Owner

eohne commented Aug 13, 2024

Hi, I understand why missing makes more sense but after some discussion a while back I/we settled on returning NaN instead.
The issue with the earlier discussion: #5 (comment)
Note also that TimeSeries.jl does not allow for missing values and only supports NaN.

Happy to discuss further though.

Edit: This behaviour is documented in the docs.

@nilshg
Copy link
Author

nilshg commented Aug 13, 2024

Ah sorry, I had skimmed the existing issues but it wasn't obvious to me that this was discussed before. Your package your rules of course, but I would say it's definitely unusual in the Julia data community to use NaN for missing values, and ironically in that thread you argue for using NaN over 0` to signal to the user that the number is... missing :)

Generally, missing is just much more of a warning flag than NaN, if only because it directly changes the element type of your vector to Union{Missing, <:Number} and will lead to errors in common data wrangling work flows much more easily, e.g. things like

julia> 0 < NaN
false

julia> 0 < missing
missing

will prevent you from accidentally subsetting your data with nonsensical values etc.

Anyway, I use the function as gp(x; args...) = DataFrame(get_prices(x; args...)) for convenience anyway so I might as well do a replace(df, NaN => missing) while I'm at it.

@eohne
Copy link
Owner

eohne commented Aug 13, 2024

I completely agree that missing would be less controversial - also raised this back in the Issue 5 discussion - as I read the discourse discussion you linked already before the discussion in issue 5 but suggested NaN from the start instead because that's the way TimeSeries.jl handles missing values - also I think returning missing would have been a breaking change and I saw no need for that.

I guess the deciding factor though was to allow easy integration with TimeSeries.jl and back then if I remember correctly either TSFrames.jl (which allow missing values) was not a thing yet or I just didn't know about it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants