Migrate to dgraph #26

dennybritz · 2020-05-23T22:16:13Z

Looked into dgraph. This PR adds an make-triples command that exports data to RDF format which can be imported directly into dgraph. Works fine on my local machine. Exports takes ~15 sec and importing using the live loader takes ~1min. So, around ~2h total for importing all data. Probably a lot less if we use the batch loader, which I haven't tried yet.

Overall this seems like a better fit than postgres, the main concern is around resource usage. It's heavier than postgres and would probably need at least a dedicated server with 8-16GB of memory or so. I'll just leave this branch here for now.

For prod, could probably put this onto a preemptible gke node pool for testing.

dennybritz · 2020-05-26T14:01:02Z

Decided to add the dgraph binaries directly into the papergraph Dockerfile to make data inserts easier.

The remaining open question now is how to handle dgraph xidmap files in production. We need to store these somewhere...

dennybritz · 2020-05-26T19:38:09Z

Trying to insert data, it seems like we need something like ~300GB (ideally SSD) with the text indices - it's huge compared to postgres. Live inserts also get really slow over time. Need to use the bulk inserter but that one is a pain to use because we need to move data around and start/stop services.

While I think it would be nice to move to dgraph, I feel like it's not worth it for now. Postgres is so simple and fast..

dennybritz · 2020-05-27T10:31:37Z

Let data loading run overnight. Out of memory. Seems like we either need significantly more RAM or run a fully distributed setup and shard the data. Costs probably would be ~ $300/mo or so for a reliable setup. Much more expensive than the simple postgres instance. Not doing this for now.

jlricon · 2020-06-01T09:55:53Z

A workaround I found in my latest implementation of the full corpus: Dump everything to one huge CSV, then use the \COPY command to load everything (This is the fastest way). Then add indices. Also, avoid the abstracts; you could just fetch them on demand from semanticscholar from the browser. That gets you most of the functionality (The only issue is that a given user might get rate limited?) (In postgres this is. Will let you know if I get everything working in a UI)

dennybritz · 2020-06-01T10:06:04Z

Oh yeah, I have no doubt that you can load the full corpus into postgres. But you won't be able to make complex graph queries over multiple hops that complete in a reasonable amount of time with joins. Joins are already slow for large 2-3 hop queries with many citations, even with the proper indices. I think the only long-term solution would be a proper graph database?

Unless you don't care about real-time queries, then postgres may be fine.

jlricon · 2020-06-01T10:33:13Z

What are some tests that I can try to do? (To see how long my postgres takes)?
(A query that I can run)

dennybritz · 2020-06-01T10:38:22Z

Not exactly sure, but off the top of my head: For a paper, find all citing (not cited) papers over 3 hops, and then filter the resulting subgraph to the 1000 most popular ones, or something like that? If the original paper is popular with tens of thousands of out-citations you're looking at intermediate results that can be quite large before the projection.

jlricon · 2020-06-01T10:45:31Z

So as one test, for the 'hallmarks of aging' paper (~5k citations), the first+second order citations) can be retrieved in 1.3 seconds. This is 27k papers. At this point the problem is rendering them, not fetching them. And this includes 0.5 seconds of planning. A stored procedure will be quicker. Though the way I intend my graph thing to be used is expanding nodes one by one: Get all citations for X, then you can go through each citation and expand it

select * from papers where id in (
select unnest(incitations) from papers where id IN(
select unnest(incitations) from papers where id='7261469291ba8a9fecf4c1f4f577a555fe01a708')
union select unnest(incitations) from papers where id='7261469291ba8a9fecf4c1f4f577a555fe01a708')

(In my schema, I have the citations in an array; everything is a single table)

This one also runs fast enough

explain analyze select * from papers where id in (
select unnest(incitations) from papers where id IN(
select unnest(incitations) from papers where id='7261469291ba8a9fecf4c1f4f577a555fe01a708')
union select unnest(incitations) from papers where id='7261469291ba8a9fecf4c1f4f577a555fe01a708')
order by citation_n  desc  nulls LAST
limit 1000

dennybritz · 2020-06-01T11:10:52Z

27k total papers seems quite small. There are many papers that more out-citations than this without any kind of join. Try for example to query the ResNet paper over 3 hops.

Though the way I intend my graph thing to be used is expanding nodes one by one: Get all citations for X, then you can go through each citation and expand it

I see, for that use case postgres is probably totally fine. I wouldn't worry about a graph database in that case, you're not doing any graph queries after all. Though in that case, why do you even need a database? Semantic Scholar already lists all citations, can't you just query it directly, or just use their interface?

jlricon · 2020-06-02T18:24:01Z

Though in that case, why do you even need a database? Semantic Scholar already lists all citations, can't you just query it directly, or just use their interface?

Because I'd get rate-limited! Also I can't do text-search through their API or thing like 'find all the meta-analysis or reviews that cited this, or that cited any of its relevant citations, etc.

For the ResNet paper, three hops with top 1000 papers takes 18 seconds, query is

WITH first_selec AS(select unnest(incitations) as inc from papers where id='2c03df8b48bf3fa39054345bafabfeff15bfd11d')
,second_selec AS(select unnest(incitations) as inc from papers where id IN(select inc from first_selec))
,third_selec AS(select unnest(incitations) as inc from papers where id IN(select inc from second_selec))
, all_ids AS (select inc FROM first_selec UNION ALL select * FROM second_selec UNION ALL select * FROM third_selec)

SELECT * FROM papers WHERE id in (select inc from all_ids)
order by citation_n  desc  nulls LAST
limit 1000

I've been trying to turn on parallel queries without success yet...

dennybritz added 4 commits May 24, 2020 00:11

Add dgraph support

ad04237

Add dgraph k8s deployment

ce748eb

Merge branch 'master' into feature/dgraph

d5c8f75

Remove diesel + hasura dependencies

0d46d63

dennybritz changed the title ~~dgraph support~~ Migrate to dgraph May 26, 2020

dennybritz added 3 commits May 26, 2020 16:49

Remove old seed and add dgraph seed

09be8dc

Don't index the abstract

4bf2c20

Add paper- and author- ids to graph

7fc8c48

dennybritz mentioned this pull request May 26, 2020

Migrate to dgraph dennybritz/papergraph-ui#2

Open

ludicrous_mode

2b6214f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to dgraph #26

Migrate to dgraph #26

dennybritz commented May 23, 2020 •

edited

Loading

dennybritz commented May 26, 2020

dennybritz commented May 26, 2020

dennybritz commented May 27, 2020 •

edited

Loading

jlricon commented Jun 1, 2020 •

edited

Loading

dennybritz commented Jun 1, 2020 •

edited

Loading

jlricon commented Jun 1, 2020

dennybritz commented Jun 1, 2020

jlricon commented Jun 1, 2020 •

edited

Loading

dennybritz commented Jun 1, 2020 •

edited

Loading

jlricon commented Jun 2, 2020

Migrate to dgraph #26

Are you sure you want to change the base?

Migrate to dgraph #26

Conversation

dennybritz commented May 23, 2020 • edited Loading

dennybritz commented May 26, 2020

dennybritz commented May 26, 2020

dennybritz commented May 27, 2020 • edited Loading

jlricon commented Jun 1, 2020 • edited Loading

dennybritz commented Jun 1, 2020 • edited Loading

jlricon commented Jun 1, 2020

dennybritz commented Jun 1, 2020

jlricon commented Jun 1, 2020 • edited Loading

dennybritz commented Jun 1, 2020 • edited Loading

jlricon commented Jun 2, 2020

dennybritz commented May 23, 2020 •

edited

Loading

dennybritz commented May 27, 2020 •

edited

Loading

jlricon commented Jun 1, 2020 •

edited

Loading

dennybritz commented Jun 1, 2020 •

edited

Loading

jlricon commented Jun 1, 2020 •

edited

Loading

dennybritz commented Jun 1, 2020 •

edited

Loading