-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to dgraph #26
base: master
Are you sure you want to change the base?
Conversation
Decided to add the dgraph binaries directly into the papergraph Dockerfile to make data inserts easier. The remaining open question now is how to handle dgraph xidmap files in production. We need to store these somewhere... |
Trying to insert data, it seems like we need something like ~300GB (ideally SSD) with the text indices - it's huge compared to postgres. Live inserts also get really slow over time. Need to use the bulk inserter but that one is a pain to use because we need to move data around and start/stop services. While I think it would be nice to move to dgraph, I feel like it's not worth it for now. Postgres is so simple and fast.. |
Let data loading run overnight. Out of memory. Seems like we either need significantly more RAM or run a fully distributed setup and shard the data. Costs probably would be ~ $300/mo or so for a reliable setup. Much more expensive than the simple postgres instance. Not doing this for now. |
A workaround I found in my latest implementation of the full corpus: Dump everything to one huge CSV, then use the \COPY command to load everything (This is the fastest way). Then add indices. Also, avoid the abstracts; you could just fetch them on demand from semanticscholar from the browser. That gets you most of the functionality (The only issue is that a given user might get rate limited?) (In postgres this is. Will let you know if I get everything working in a UI) |
Oh yeah, I have no doubt that you can load the full corpus into postgres. But you won't be able to make complex graph queries over multiple hops that complete in a reasonable amount of time with joins. Joins are already slow for large 2-3 hop queries with many citations, even with the proper indices. I think the only long-term solution would be a proper graph database? Unless you don't care about real-time queries, then postgres may be fine. |
What are some tests that I can try to do? (To see how long my postgres takes)? |
Not exactly sure, but off the top of my head: For a paper, find all citing (not cited) papers over 3 hops, and then filter the resulting subgraph to the 1000 most popular ones, or something like that? If the original paper is popular with tens of thousands of out-citations you're looking at intermediate results that can be quite large before the projection. |
So as one test, for the 'hallmarks of aging' paper (~5k citations), the first+second order citations) can be retrieved in 1.3 seconds. This is 27k papers. At this point the problem is rendering them, not fetching them. And this includes 0.5 seconds of planning. A stored procedure will be quicker. Though the way I intend my graph thing to be used is expanding nodes one by one: Get all citations for X, then you can go through each citation and expand it
(In my schema, I have the citations in an array; everything is a single table) This one also runs fast enough
|
27k total papers seems quite small. There are many papers that more out-citations than this without any kind of join. Try for example to query the ResNet paper over 3 hops.
I see, for that use case postgres is probably totally fine. I wouldn't worry about a graph database in that case, you're not doing any graph queries after all. Though in that case, why do you even need a database? Semantic Scholar already lists all citations, can't you just query it directly, or just use their interface? |
Because I'd get rate-limited! Also I can't do text-search through their API or thing like 'find all the meta-analysis or reviews that cited this, or that cited any of its relevant citations, etc. For the ResNet paper, three hops with top 1000 papers takes 18 seconds, query is
I've been trying to turn on parallel queries without success yet... |
Ref #24
Looked into dgraph. This PR adds an
make-triples
command that exports data to RDF format which can be imported directly into dgraph. Works fine on my local machine. Exports takes ~15 sec and importing using the live loader takes ~1min. So, around ~2h total for importing all data. Probably a lot less if we use the batch loader, which I haven't tried yet.Overall this seems like a better fit than postgres, the main concern is around resource usage. It's heavier than postgres and would probably need at least a dedicated server with 8-16GB of memory or so. I'll just leave this branch here for now.
For prod, could probably put this onto a preemptible gke node pool for testing.