Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to dgraph #26

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Migrate to dgraph #26

wants to merge 8 commits into from

Conversation

dennybritz
Copy link
Owner

@dennybritz dennybritz commented May 23, 2020

Ref #24

Looked into dgraph. This PR adds an make-triples command that exports data to RDF format which can be imported directly into dgraph. Works fine on my local machine. Exports takes ~15 sec and importing using the live loader takes ~1min. So, around ~2h total for importing all data. Probably a lot less if we use the batch loader, which I haven't tried yet.

Overall this seems like a better fit than postgres, the main concern is around resource usage. It's heavier than postgres and would probably need at least a dedicated server with 8-16GB of memory or so. I'll just leave this branch here for now.

For prod, could probably put this onto a preemptible gke node pool for testing.

@dennybritz dennybritz changed the title dgraph support Migrate to dgraph May 26, 2020
@dennybritz
Copy link
Owner Author

Decided to add the dgraph binaries directly into the papergraph Dockerfile to make data inserts easier.

The remaining open question now is how to handle dgraph xidmap files in production. We need to store these somewhere...

@dennybritz
Copy link
Owner Author

Trying to insert data, it seems like we need something like ~300GB (ideally SSD) with the text indices - it's huge compared to postgres. Live inserts also get really slow over time. Need to use the bulk inserter but that one is a pain to use because we need to move data around and start/stop services.

While I think it would be nice to move to dgraph, I feel like it's not worth it for now. Postgres is so simple and fast..

@dennybritz
Copy link
Owner Author

dennybritz commented May 27, 2020

Let data loading run overnight. Out of memory. Seems like we either need significantly more RAM or run a fully distributed setup and shard the data. Costs probably would be ~ $300/mo or so for a reliable setup. Much more expensive than the simple postgres instance. Not doing this for now.

@jlricon
Copy link

jlricon commented Jun 1, 2020

A workaround I found in my latest implementation of the full corpus: Dump everything to one huge CSV, then use the \COPY command to load everything (This is the fastest way). Then add indices. Also, avoid the abstracts; you could just fetch them on demand from semanticscholar from the browser. That gets you most of the functionality (The only issue is that a given user might get rate limited?) (In postgres this is. Will let you know if I get everything working in a UI)

@dennybritz
Copy link
Owner Author

dennybritz commented Jun 1, 2020

Oh yeah, I have no doubt that you can load the full corpus into postgres. But you won't be able to make complex graph queries over multiple hops that complete in a reasonable amount of time with joins. Joins are already slow for large 2-3 hop queries with many citations, even with the proper indices. I think the only long-term solution would be a proper graph database?

Unless you don't care about real-time queries, then postgres may be fine.

@jlricon
Copy link

jlricon commented Jun 1, 2020

What are some tests that I can try to do? (To see how long my postgres takes)?
(A query that I can run)

@dennybritz
Copy link
Owner Author

Not exactly sure, but off the top of my head: For a paper, find all citing (not cited) papers over 3 hops, and then filter the resulting subgraph to the 1000 most popular ones, or something like that? If the original paper is popular with tens of thousands of out-citations you're looking at intermediate results that can be quite large before the projection.

@jlricon
Copy link

jlricon commented Jun 1, 2020

So as one test, for the 'hallmarks of aging' paper (~5k citations), the first+second order citations) can be retrieved in 1.3 seconds. This is 27k papers. At this point the problem is rendering them, not fetching them. And this includes 0.5 seconds of planning. A stored procedure will be quicker. Though the way I intend my graph thing to be used is expanding nodes one by one: Get all citations for X, then you can go through each citation and expand it

select * from papers where id in (
select unnest(incitations) from papers where id IN(
select unnest(incitations) from papers where id='7261469291ba8a9fecf4c1f4f577a555fe01a708')
union select unnest(incitations) from papers where id='7261469291ba8a9fecf4c1f4f577a555fe01a708')

(In my schema, I have the citations in an array; everything is a single table)

This one also runs fast enough

explain analyze select * from papers where id in (
select unnest(incitations) from papers where id IN(
select unnest(incitations) from papers where id='7261469291ba8a9fecf4c1f4f577a555fe01a708')
union select unnest(incitations) from papers where id='7261469291ba8a9fecf4c1f4f577a555fe01a708')
order by citation_n  desc  nulls LAST
limit 1000

@dennybritz
Copy link
Owner Author

dennybritz commented Jun 1, 2020

27k total papers seems quite small. There are many papers that more out-citations than this without any kind of join. Try for example to query the ResNet paper over 3 hops.

Though the way I intend my graph thing to be used is expanding nodes one by one: Get all citations for X, then you can go through each citation and expand it

I see, for that use case postgres is probably totally fine. I wouldn't worry about a graph database in that case, you're not doing any graph queries after all. Though in that case, why do you even need a database? Semantic Scholar already lists all citations, can't you just query it directly, or just use their interface?

@jlricon
Copy link

jlricon commented Jun 2, 2020

Though in that case, why do you even need a database? Semantic Scholar already lists all citations, can't you just query it directly, or just use their interface?

Because I'd get rate-limited! Also I can't do text-search through their API or thing like 'find all the meta-analysis or reviews that cited this, or that cited any of its relevant citations, etc.

For the ResNet paper, three hops with top 1000 papers takes 18 seconds, query is

WITH first_selec AS(select unnest(incitations) as inc from papers where id='2c03df8b48bf3fa39054345bafabfeff15bfd11d')
,second_selec AS(select unnest(incitations) as inc from papers where id IN(select inc from first_selec))
,third_selec AS(select unnest(incitations) as inc from papers where id IN(select inc from second_selec))
, all_ids AS (select inc FROM first_selec UNION ALL select * FROM second_selec UNION ALL select * FROM third_selec)

SELECT * FROM papers WHERE id in (select inc from all_ids)
order by citation_n  desc  nulls LAST
limit 1000

I've been trying to turn on parallel queries without success yet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants