Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add postgres prebuilds #445

Open
jacksund opened this issue May 13, 2023 · 5 comments
Open

add postgres prebuilds #445

jacksund opened this issue May 13, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@jacksund
Copy link
Owner

Describe the desired feature

just like how sqlite3 has prebuilds, we can do the same with postgres dump files:
https://www.postgresql.org/docs/8.1/backup.html

Additional context

No response

To-do items

No response

@jacksund jacksund added the enhancement New feature or request label May 13, 2023
@scott-materials
Copy link
Contributor

I've used this in the past for exporting & importing databases, and I run an automated dump every night for backup.

Anything you're planning to implement here?

@jacksund
Copy link
Owner Author

The plan is to...

  • create a dump file with all of the third party data loaded (matproj, jarvis, cod, etc.)
  • upload this dump file to the simmate cdn (where the sqlite prebuilds are also stored)
  • have the simmate database reset command offer to download and load the dump file

If you're interested in sharing the lab's calculation results with others, we can eventually add some warren lab data to the prebuilds too

@jacksund
Copy link
Owner Author

I'm not sure if this will be faster or slower than the load-remote-archives command, so there's a chance I scrap this feature too

@scott-materials
Copy link
Contributor

It takes 4-12 hours to grab all the data for postgres when loading all archives, in my experience. I presume the bottleneck is that the cdn is rate limited.

I believe our database with all matproj, jarvis, cod, + oqmd is 4 GB for reference. Not sure how it compares to the original data that is stored at the cdn.

@jacksund
Copy link
Owner Author

I presume the bottleneck is that the cdn is rate limited.

Downloading from the CDN is actually really quick and only takes a few minutes with UNC's crazy internet speeds. The slow part is then taking that CSV data and then saving it your postgres database. Right now, thebottleneck is recalculating the MatProj hull energies for all systems, so I need to cache these. Once cached, I bet the load-remote-archives command will only take ~1-2 hrs.

Not sure how it compares to the original data that is stored at the cdn.

you can look at the files in ~/simmate/sqlite-prebuilds/ to see what's stored in the cdn. These are really just CSV files compressed into a ZIP and I think they're normally ~1-2GB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants