Weird output #20

Rosemeis · 2023-07-19T08:43:30Z

Hi,

I have tested Neural ADMIXTURE on the 1000 Genomes Project data but I'm seeing some weird results. It is a simple dataset of the 2504 phase 3 individuals with 500,000 SNPs randomly sampled (MAF > 0.05) across all chromosomes. I see no issues when using standard ADMIXTURE or SCOPE as an example. The outputted PCA plot in Neural ADMIXTURE also looks fine. I have uploaded the admixture plots for K = 5, 6, 7 each with 10 runs using different seeds. Neural ADMIXTURE was run with all default parameter settings.

Environment

conda create -n neural python=3.9
conda activate neural
pip install neural-admixture

Command example

neural-admixture train --k 5 --data_path gp.merged.downsampled.bed --save_dir ./ --init_file test_s1_k5 --name tgp.downsampled.neural.s1 --seed 1 > tgp.downsampled.neural.s1.log

The text was updated successfully, but these errors were encountered:

dmasmont · 2023-07-19T17:25:43Z

Hi Jonas Meisner,

Thanks for reaching out! The results do look strange indeed. Could we arrange a way to get the actual data you are using, and maybe the results that original ADMIXTURE is providing? Please, send me an email at [email protected]

Regards,
Daniel Mas

Rosemeis · 2023-07-20T07:15:05Z

Yes! The data is the newest version of 1KGP:
http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/

But that also includes the related individuals, so I have attached the names of the original 2504 individuals, the 500K sampled sites as well as a run from ADMIXTURE with K=5.
tgp.downsampled.zip

Best,
Jonas

haneenih7 · 2023-07-26T16:45:59Z

Hey,
thanks for neural-admixture, it's really fast. I actually had exact same problem, i.e., the replacement of clusters... did you know why is that?
Additionally, I do not see an option for bootstrapping as the original ADMIXTURE
-B[X] : do bootstrapping [with X replicates]
Am I missing the parameter, or it's true there is no bootstrapping option?

Thanks!

AlbertDominguez · 2023-07-31T19:44:07Z

Hi @Rosemeis and @haneenih7,

Thanks to both for your interest and testing the software!

@Rosemeis, thanks for pointing the issue out and sending us some results with data! Having checked it out, it looks like the initialization of the Q values by the network were very different, and sometimes too off to be recoverable by the network. We didn't see this at all while developing the method, so we suspect it might be some change introduced in a dependency. We have also seen that the initialization PCK-Means (Supp. Algorithm 1 in the paper) appears to be more stable, so we have set the default to be PCK-Means instead of the current one, which was PCArchetypal.

Nevertheless, in order to stabilize the results, we have added a "Warmup training" for initialization, where we supervisedly train the encoder to estimate Q values using as labels a function of the distance to the initial values of the P matrix in PCA space. This way, we not only have a sensible initialization for the P matrix, but also for the encoder which computes Q. In practice, there's no change as to how the algorithm is called from the CLI! Convergence checks are also performed starting at epoch 15 to avoid early stopping too early.

This is how results look like for the data you provided using the default parameters you were using (for K=5):

Let me know in case of any followup, happy to discuss the issue a bit more if necessary! To install the upgrades, simply run pip install -U neural-admixture.

@haneenih7, regarding bootstrapping, there currently no such option for Neural ADMIXTURE. I will open a new issue for the feature and we will try to publish it in the next release, along with cross-validation!

Rosemeis · 2023-08-13T15:52:36Z

Hi @AlbertDominguez

Thanks for looking into the issue!
After the update, the results are looking a bit more consistent.

However, I still see a lot of issues. Here are the neural-admixture runs for K=5:

Here are the ADMIXTURE runs, all getting the same solution:

Resembling your results, but as for an example with the admixed AFR populations, you can see that there is no European component as well as a lot of individuals getting fully assigned a non-proper ancestry and generally not very consistent across runs at a finer scale. This is in my opinion signs of convergence problems if I was using ADMIXTURE.
Also if I compute the log-likelihoods of the Q and F matrices, then they are also far off the results of ADMIXTURE as well as SCOPE that optimizes the least square.

neural-admixture (10 runs, K=5)
-1107004412.7
-1107158223.9
-1107499553.3
-1107478393.7
-1107716015.0
-1107295943.5
-1107270413.1
-1107290867.5
-1107093124.2
-1107849482.0

ADMIXTURE (10 runs, K=5)
-1092429529.5
-1092429519.0
-1092429537.5
-1092429534.9
-1092429535.0
-1092429531.4
-1092429520.6
-1092429521.8
-1092429530.0
-1092429521.5

SCOPE (10 runs, K=5)
-1092788577.1
-1092788572.3
-1092789272.9
-1092788574.4
-1092788569.2
-1092788570.0
-1092788569.5
-1092788572.9
-1092789298.4
-1092788573.3

The log-likelihood also doesn't seem to be reported by neural-admixture, thus making it a bit harder to debug. It would also be nice to have a feature to control the number of threads!

Best,
Jonas

AlbertDominguez mentioned this issue Jul 31, 2023

Bootstrapping #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird output #20

Weird output #20

Rosemeis commented Jul 19, 2023

dmasmont commented Jul 19, 2023

Rosemeis commented Jul 20, 2023

haneenih7 commented Jul 26, 2023

AlbertDominguez commented Jul 31, 2023

Rosemeis commented Aug 13, 2023 •

edited

Loading

Weird output #20

Weird output #20

Comments

Rosemeis commented Jul 19, 2023

dmasmont commented Jul 19, 2023

Rosemeis commented Jul 20, 2023

haneenih7 commented Jul 26, 2023

AlbertDominguez commented Jul 31, 2023

Rosemeis commented Aug 13, 2023 • edited Loading

Rosemeis commented Aug 13, 2023 •

edited

Loading