Skip to content
This repository has been archived by the owner on Jul 13, 2021. It is now read-only.

support datalimits and respect area in violin #730

Merged
merged 6 commits into from
May 16, 2021
Merged

Conversation

piever
Copy link
Member

@piever piever commented May 5, 2021

From a discussion on discourse (https://discourse.julialang.org/t/side-by-side-violin-plots-with-vegalite-jl/60523/2?u=piever) as well as a conversation on slack (copied below to avoid losing it), it seems like user expects violin plots to be renormalized together, in that all densities are rescaled by the same factor to fit in the allocated width.

This PR allows to pass both sides together in order to normalize correctly, and allows "trimming options" (as in datalimits=(0, Inf) or datalimits=extrema).

Main issue is that this way both sides of the violin have to be plotted in the same call (to be able to figure out the correct rescaling factor).

Main issues so far:

  • Cairo drawing poly! with a vector of colors (as many as there are polygons) seems very slow.
  • GLMakie somehow has trouble drawing the poly! with different colors (one per polygon), and draws them transparent, really not sure why (ups, had the wrong AbstractPlotting checked out, things actually work fine in GLMakie)

cc: @sethaxen @yakir12

Slack chat for posterity

I have another "renormalization" question about statistical data visualization. It seems that both StatsPlots and Makie plot violins by renormalizing the density of each category so that the maximum is the maximum of the "violin width" https://user-images.githubusercontent.com/16589944/98538614-506c3780-228b-11eb-881c-158c2f781798.png, whereas it looks to me as if ggplot2 renormalizes by a global quantity, so that different categories can be shown with different widths: https://ggplot2.tidyverse.org/reference/geom_violin-1.png
question 1) Are both renormalization strategies acceptable, or are there strong reasons in favor of one or the other?
question 2) I was planning to implement ridge density plots, and was curious: if we keep the renormalization by category approach for violins, should we also renormalize by category in ridge plots?

(47 kB)
https://user-images.githubusercontent.com/16589944/98538614-506c3780-228b-11eb-881c-158c2f781798.png

(69 kB)
https://ggplot2.tidyverse.org/reference/geom_violin-1.png

Seth Axen:axe: 16 hours ago
My two cents: it's important that one unit area corresponds to the same amount of probability mass for each violin, which allows easy visual comparison of the violins. So I guess I prefer ggplot's approach. How would StatsPlots and Makie plot the second version?

Pietro Vertechi 15 hours ago
so I guess even in a side by side violin (this just popped up https://discourse.julialang.org/t/side-by-side-violin-plots-with-vegalite-jl/60523/2?u=piever) both sides should have equal area? We may need to change the implementation a bit then, because one can no longer plot left and right side separately (edited)

JuliaLangJuliaLang
Side by side violin plots with VegaLite.jl
A promising alternative is using split violins in CairoMakie using CairoMakie xs1 = rand(["a", "b", "c"], 1000) ys1 = randn(1000) dodge1 = rand(1:2, 1000) xs2 = rand(["a", "b", "c"], 1000) ys2 = randn(1000) dodge2 = rand(1:2, 1000) fig = Figure() ax = Axis(fig[1, 1]) violin!(ax, xs1, ys1, dodge = dodge1, side = :left, color = "orange") violin!(ax, xs2, ys2, dodge = dodge2, side = :right, color = "teal") fig which produces Looks nice, I think. However, with my actual data, it looks like ...
Yesterday at 5:15 PM

Pietro Vertechi 15 hours ago
(looks like ggplot trims the violin by default, not 100% sure that's a good idea)

Pietro Vertechi 15 hours ago
(related question: should the area / probability mass ratio be respected also across different subplots of the same facet?)

Yakir Gagnon:computer: 4 hours ago
Just plotted violins, I also find it irritating that the smoothing function extends the data across the y-axis which makes it look like there is data below and above the data's extrema. This is problematic if the data is by definition all positive (e.g. body height) and due to some data close to zero the violin spills below zero: impossible.

Seth Axen:axe: 3 hours ago
so I guess even in a side by side violin (this just popped up https://discourse.julialang.org/t/side-by-side-violin-plots-with-vegalite-jl/60523/2?u=piever) both sides should have equal area?
This gets to a deeper issue: what is the area that a violin plot is representing? e.g. if its representing probability mass, and two categories are shown on the two sides, then yes, the two sides should have the same area. But if the area is meant to represent proportion of a population, and the categories have two different total populations, then it might make sense to use different areas for the two sides. This is something that ggplot doesn't do. The first 3 pages of http://www.mjskay.com/papers/chi2020-pgog.pdf is a nice read on this.

Seth Axen:axe: 3 hours ago
(looks like ggplot trims the violin by default, not 100% sure that's a good idea)
Just plotted violins, I also find it irritating that the smoothing function extends the data across the y-axis which makes it look like there is data below and above the data's extrema.
I also prefer truncated violins by default. The tails can easily be misleading, since they extend outside the data range. This is why both the plots implemented in ArviZ.jl and MCMCChains.jl default to truncating to the data range.

Yakir Gagnon:computer: 2 hours ago
So ideally, a histogram-related function should allow for controlling all these options... i.e.:
truncated = true, mass = true, ...
I thought one of the hist-plotting functions did that somewhere, I remember this being discussed somewhere already...

Pietro Vertechi 2 hours ago
I asked about this in the case of normalized stack histogram, where for now we conform to ggplot (normalize by class), which seems to be in disagreement with the reference above. There, it is even trickier, because for stacked bars, the reference above makes sense (color the whole by category), but for dodged bars (or bars in different subplots of a facet) it is less intuitive.
The API will have to be trickier than mass = true, you basically need to specify how to group for renormalization purposes. The simplest may probably be based on data-format: different columns are normalized separately, the same column split by a categorical variable is normalized together.

Yakir Gagnon:computer: 2 hours ago
omg

Yakir Gagnon:computer: 2 hours ago
amazing work 🙂

Pietro Vertechi 2 hours ago
Re: truncation, there is a trim = false setting that accidentally no longer works, but can be easily fixed. I confess I think I'm not in favor of trimming, as it becomes less clear what you are plotting. It is no longer "the convolution of my data distribution with a Gaussian", but becomes "the convolution of my data distribution with a Gaussian, unless the point is outside the empirical extrema of my distribution, in which case return zero".
It becomes even messier when you normalize together different categories: should we now trim according to the extrema of the whole data or just one category? Maybe a good compromise is to allow the user to pass explicit trimming values (eg, trim = (0, Inf) if you know the variable is positive).

Yakir Gagnon:computer: 2 hours ago
I vote for the latter, I mean, I only care if the resulting hist shows an unrealistic distribution. So while the resulting hist is not a predefined distribution (Normal etc), I would like to have a convolution+rules.

@piever
Copy link
Member Author

piever commented May 5, 2021

New example (included in the docs)

using GLMakie
N = 1000
xs = rand(["a", "b", "c"], N)
dodge = rand(1:2, N)
side = rand([:left, :right], N)
colors = map(side) do s
    return s == :left ? "orange" : "teal"
end
ys = map(side) do s
    return s == :left ? randn() : rand()
end
violin(xs, ys, dodge = dodge, side = side, color = colors, datalimits = extrema)

violin

return (rgba.r, rgba.g, rgba.b, rgba.alpha)
end

sa = StructArray((x = x̂, side = sides, color = colors))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if one should also group by color here, it is a bit of an odd interface to have to pass a vector of colors that is the same length as a vector of sides. Maybe there could be attributes leftcolor and rightcolor instead?

@piever
Copy link
Member Author

piever commented May 7, 2021

This is mergeable already, apart from an API decision. That is, given that it is better to pass both sides of the violin at the same time (for proper normalization), how do we pass colors for the two sides?

It seems to me that this can be done in two possible ways:

  1. color can be a vector with the same length of the data, specifying for each data point its color (issue: the user should make sure that to the same value of side always corresponds the same value of color)
  2. the user explicitly passes leftcolor and rightcolor

Currently this implements 1., but it should be straightforward to port it to 2. (which would be my current preference, but I'm not 100% sure).

@SimonDanisch
Copy link
Member

Maybe a tuple of colors for the sides? I'd like to restrict color passing to the color attribute whenever possible...

@piever
Copy link
Member Author

piever commented May 7, 2021

Maybe a tuple of colors for the sides? I'd like to restrict color passing to the color attribute whenever possible...

Thing is, one could pass either a unique color (the same for both sides) or separate colors, so tuples do create some ambiguity (we already use it for transparency). I would be OK with a NamedTuple though, eg color = (left = "red", right = "green"). Would that be a reasonable API?

@SimonDanisch
Copy link
Member

Yeah... and then:

color = (left = [:red, ...] , right = [:green, ...])

?

@piever
Copy link
Member Author

piever commented May 7, 2021

Ah, if you want each of the different left violins to be of a different color? Yes, that should work automatically, because the value of color.left would be passed as is to poly!. I just need to make sure that left and right violins are on two separate poly! calls (so that the length makes sense).

@piever piever force-pushed the pv/violin branch 2 times, most recently from 9905137 to 217c5d2 Compare May 7, 2021 17:03
xs = rand(1:3, N)
dodge = rand(1:2, N)
side = rand([:left, :right], N)
color = Observable((left = :orange, right = :teal))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing directly the named tuple, ie color = (left = :orange, right = :teal) causes a bug with the theme merging mechanism, that is:

julia> violin(xs, ys, side = side, color = (left=:orange, right=:teal))
ERROR: Type missmatch while merging plot attributes with theme for key: color.
Found Observable{Any} with 0 listeners. Value:
:black in theme, while attributes contains: Attributes with 2 entries:
  left => orange
  right => teal

I imagine that should be fixed independently in AbstractPlotting?

@jkrumbiegel
Copy link
Member

Can you check how many points the violins have? I remember dialing down density because it was a very high number, leading to CairoMakie slowness. But maybe there's also something slow in the CairoMakie code

@piever
Copy link
Member Author

piever commented May 7, 2021

Should be good to merge if CI passes, other than the concern in https://github.com/JuliaPlots/AbstractPlotting.jl/pull/730/files#r628403782, (color does not accept named tuples at the moment) but I imagine that needs to be fixed separately.

@piever
Copy link
Member Author

piever commented May 7, 2021

Can you check how many points the violins have? I remember dialing down density because it was a very high number, leading to CairoMakie slowness. But maybe there's also something slow in the CairoMakie code

I also brought down violin to 200 (same as density). There seems to be a big difference (like 2 order of magnitudes) whether one passes a scalar color to poly! or a vector of colors (one per "closed polygon", I'm passing a Vector{Vector{Point2f0}} to poly), eg

julia> @time display(violin(rand(1:2, 100), rand(100), color = :red));
  0.170620 seconds (289.41 k allocations: 14.615 MiB, 63.42% compilation time)

julia> @time display(violin(rand(1:2, 100), rand(100), color = [:red, :blue]));
  4.379037 seconds (423.27 k allocations: 18.984 MiB, 2.51% compilation time)

In some cases I get over two order of magnitudes difference, I suspect it is a performance bug somewhere.

@jkrumbiegel
Copy link
Member

then it's probably hitting a mesh path, where it should hit a specialized poly path. i'll check it out

@jkrumbiegel
Copy link
Member

ok yeah it's definitely hitting mesh, but it seems the time is spent in Cairo.finish, and png is still relatively fast. so an svg thing at its core, although we should avoid the meshes anyway

@jkrumbiegel
Copy link
Member

jkrumbiegel commented May 7, 2021

oof that was actually a pretty bad bug I think, it probably drew multiple meshes for each poly (which is why the edges were jagged, because of overdrawing antialiasing.. I thought this was just a mesh vs poly artifact)

So anyway, with this PR JuliaPlots/CairoMakie.jl#159 the time for this snippet:

polys = [decompose(Point2f0,
    Circle(Point2f0(i, 0), 0.3)) for i in 1:10]

f, ax, p = poly(polys, color = rand(RGBf0, 10))

@time display(f)

goes from 7.622945 seconds (476.93 k allocations: 14.165 MiB) to 0.004314 seconds (8.71 k allocations: 358.250 KiB), so just a cool 1700x speedup haha

@jkrumbiegel
Copy link
Member

jkrumbiegel commented May 8, 2021

About the colors, I think I'd prefer the vector of sides, vector of colors approach. That way you can color each density however you want and violin is not the only recipe with a weird named tuple exception for attributes. It could be complicated to treat named tuples in a special way for attribute conversion, as we pass attributes to Axis and Figure that way right now, and there are similar cases all over I think.

In the simples case the user passes only one side and one color anyway because most users don't touch array attributes in the beginning. And once they understand the array attributes, vector of sides, vector of colors would not be that difficult.

Maybe we could also just call side flip or invert instead. Then it would work with a boolean array which is a bit simpler, and it would work for horizontal and vertical. The default would be the same side as for density (up or right). I'm not sure if violin supports horizontal mode right now?

@piever piever marked this pull request as draft May 8, 2021 20:30
@piever
Copy link
Member Author

piever commented May 8, 2021

Marked as draft cause it seems like we still need to decide the API.

Maybe we could also just call side flip or invert instead. Then it would work with a boolean array which is a bit simpler, and it would work for horizontal and vertical. The default would be the same side as for density (up or right). I'm not sure if violin supports horizontal mode right now?

Violin does not yet support horizontal mode, but it definitely should (in which case one no longer really needs density... We should figure out how to unify this). As for renaming side to a boolean attribute, there is the problem that one also has the option :both to do a double-sided violin which would be hard to express, but maybe there are ways around this (you could just manually plot both sides for example).

About the colors, I think I'd prefer the vector of sides, vector of colors approach. That way you can color each density however you want and violin is not the only recipe with a weird named tuple exception for attributes. It could be complicated to treat named tuples in a special way for attribute conversion, as we pass attributes to Axis and Figure that way right now, and there are similar cases all over I think.

As for how to pass the colors. I have the following doubt. If one has to pass a vector of colors of the same length as the vector of sides, what happens if they don't "match"? That is, if the user passes:

violin([1, 1, 1, 1], rand(4), side = [:left, :left, :right, :right], color = [:red, :blue, :red, :blue])

Should one group by the three variables (x, side and color), so that you get 4 different violins that overlap?

Or should the length of the color vector match the number of violins instead (in this case 2)? So basically one in general just passes color = repeat([colorleft, colorright], outer=length(x)) to give one color to left violins and another to right violins? This would actually have the cleaner implementation on the Makie side, as the color attribute is just forwarded to poly.

@jkrumbiegel
Copy link
Member

I think it's completely fine for them not to match. Why should a user not be able to plot violins of arbitrary sides and colors? They wouldn't in the typical case but that doesn't mean it should be impossible. For example, someone might want to highlight one particular violin.

So I think colors and sides should be scalar or the same length as x/y.

We can also easily support all three sides types at once and you're right, that kind of makes density obsolete or redundant. We can think of just rerouting density to violin then maybe? Although the default for violin is double sided and vertical for me, while density is one sided and horizontal.

@piever
Copy link
Member Author

piever commented May 9, 2021

So I think colors and sides should be scalar or the same length as x/y.

You are probably right, in that in practice when you're doing a violin, your data will be in a table and it's easy to create these vectors with map from one or more columns.

I think it's completely fine for them not to match. Why should a user not be able to plot violins of arbitrary sides and colors? They wouldn't in the typical case but that doesn't mean it should be impossible. For example, someone might want to highlight one particular violin.

I also think that's perfectly fine, what I meant is that, if one goes for "colors and sides should be scalar or the same length as x/y", there is the option that the user may give different colors for the same value of x, side, and dodge (or even give more colors than there are violins), whereas IMO the color should be a function color(x, side, dodge).

The only ways to do something sensible IMO if the user passes a vector of colors that is not a function of x, side, and dodge are:

  1. Do a further subdivision, so points are grouped together in the same density if they have same x, same side, same dodge, and same color.
  2. Error and require that color is a function of x, side, and dodge.

Option 1. was the initial implementation, so if we are happy with that I can just revert the last commit, otherwise I can revert the last commit and add an extra check.

We can think of just rerouting density to violin then maybe? Although the default for violin is double sided and vertical for me, while density is one sided and horizontal.

I feel we'll need to consolidate the API a bit. We have violin, density, AlgebraOfGraphics.density and we'll need something for Ridge plots as well (even though that is in practice implemented already as density with offset). I don't think it'd be absurd to have violin (and boxplot) default to horizontal and top, as in JuliaPlots/StatsMakie.jl#116. I would actually be happy to have violin generalize density because that would also fix MakieOrg/Makie.jl#925. We should discuss what is a good name for a recipe that generalizes simultaneously violin, density and Ridge plots.

That definitely belong to a separate PR though, as this one is mostly about fixing the rescaling.

@piever
Copy link
Member Author

piever commented May 15, 2021

Went for the conservative API (option 2 above), I think it makes the most sense. I've also added an example where all violins have different colors, to show that it is possible.

Should be good to go if docs build correctly!

@piever piever marked this pull request as ready for review May 15, 2021 16:57
@piever
Copy link
Member Author

piever commented May 15, 2021

Looks like docs built fine. Link to new violin page: https://makie.juliaplots.org/previews/PR730/plotting_functions/violin.html

@jkrumbiegel
Copy link
Member

Nice, I think the conservative choice is good

@SimonDanisch
Copy link
Member

Thanks a lot :)

@SimonDanisch SimonDanisch merged commit 185de01 into master May 16, 2021
@SimonDanisch SimonDanisch deleted the pv/violin branch May 16, 2021 10:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants