Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local SPARQL endpoint (Fuseki / Qlever) issues #98

Open
coret opened this issue Jul 1, 2024 · 5 comments
Open

Local SPARQL endpoint (Fuseki / Qlever) issues #98

coret opened this issue Jul 1, 2024 · 5 comments

Comments

@coret
Copy link

coret commented Jul 1, 2024

I have loaded all NA photocollection N-triples (including the 3GB testfile "7") into my local (production) GraphDB and the LD Workbench works great.

The same iterator/generator doesn't work when I use the endpoint https://service.archief.nl/sparql: The Generator did not run successfully, it could not get the results from : Invalid SPARQL endpoint response from https://service.archief.nl/sparql (HTTP status 400)).

I thought this would also be a good moment to test out Qlever. But the LD Workbench generates a SPARQL query which Qlever can't handle (yet) and I don't see how to change the LD Workbench behaviour.

2024-06-27 12:00:06.842 - ERROR: Invalid SPARQL query: This parser currently doesn't support COUNT(*), please specify an explicit expression for the COUNT
2024-06-27 12:00:06.842 - ERROR: SELECT (COUNT(*) AS ?count) WHERE {
  SELECT ?this WHERE { ?this <http://www.openarchives.org/ore/terms/isAggregatedBy> <https://archief.nl/doc/2.10.62ntfoto>. }
  LIMIT 10
}

I've also tested Fuseki (v5), but this (same generator/iterator as used in above cases) ends in an out-of-memory message in the LD Workbench. Adding a batchSize doesn't help.

$ ./apache-jena-fuseki-5.0.0/fuseki-server --tdb2 --loc data/NA /na-fotocollectie
 
 17:59:37 INFO  Server          :: Running in read-only mode for /na-fotocollectie
17:59:37 INFO  Server          :: Apache Jena Fuseki 5.0.0
17:59:37 WARN  ServletContextHandler :: BaseResource file:///home/http/fuseki.coret.org/./apache-jena-fuseki-5.0.0/webapp/ is aliased to file:///home/http/fuseki.coret.org/apache-jena-fuseki-5.0.0/webapp/ in oeje10w.WebAppContext@45394b31{org.apache.jena.fuseki.Servlet,/,b=file:///home/http/fuseki.coret.org/./apache-jena-fuseki-5.0.0/webapp/,a=STOPPED,h=oeje10s.SessionHandler@1ec7d8b3{STOPPED}}. May not be supported in future releases.
17:59:37 WARN  ContextHandler  :: Base Resource should not be an alias
17:59:37 INFO  Config          :: FUSEKI_HOME=/home/http/fuseki.coret.org/./apache-jena-fuseki-5.0.0
17:59:37 INFO  Config          :: FUSEKI_BASE=/home/http/fuseki.coret.org/run
17:59:37 INFO  Config          :: Shiro file: file:///home/http/fuseki.coret.org/run/shiro.ini
17:59:37 INFO  Config          :: Template file: templates/config-tdb2-dir-readonly
17:59:38 INFO  Server          :: Database: TDB2 dataset: location=data/NA
17:59:38 INFO  Server          :: Path = /na-fotocollectie
17:59:38 INFO  Server          ::   Memory: 4,0 GiB
17:59:38 INFO  Server          ::   Java:   17.0.11
17:59:38 INFO  Server          ::   OS:     Linux 6.1.0-13-amd64 amd64
17:59:38 INFO  Server          ::   PID:    2022356
17:59:38 INFO  Server          :: Started 2024/07/01 17:59:38 CEST on port 3030

$ npx @netwerk-digitaal-erfgoed/ld-workbench@latest -p "NA fotocollectie (via SPARQL)" -s 7-000
Welcome to LD Workbench version 2.4.2
▶ Starting pipeline “NA fotocollectie (via SPARQL)”
✔ Validating pipeline
⠼ Loading results from iterator
<--- Last few GCs --->

[2024610:0x62539b0]    62164 ms: Mark-sweep 4049.4 (4139.6) -> 4039.6 (4142.6) MB, 1293.3 / 0.0 ms  (average mu = 0.095, current mu = 0.012) task scavenge might not succeed
[2024610:0x62539b0]    64113 ms: Mark-sweep 4052.4 (4142.6) -> 4042.7 (4145.8) MB, 1933.1 / 0.0 ms  (average mu = 0.045, current mu = 0.008) task scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb09980 node::Abort() [node]
 2: 0xa1c235 node::FatalError(char const*, char const*) [node]
 3: 0xcf784e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xcf7bc7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xeaf465  [node]
 6: 0xebf12d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0xf222f4 v8::internal::ScavengeJob::Task::RunInternal() [node]
 8: 0xdb59db non-virtual thunk to v8::internal::CancelableTask::Run() [node]
 9: 0xb77524 node::PerIsolatePlatformData::RunForegroundTask(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [node]
10: 0xb79389 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [node]
11: 0x15633c6  [node]
12: 0x1575af4  [node]
13: 0x1563d18 uv_run [node]
14: 0xa43dd5 node::SpinEventLoop(node::Environment*) [node]
15: 0xb4bab6 node::NodeMainInstance::Run(node::EnvSerializeInfo const*) [node]
16: 0xacd3f2 node::Start(int, char**) [node]
17: 0x7fb7fe84624a  [/lib/x86_64-linux-gnu/libc.so.6]
18: 0x7fb7fe846305 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
19: 0xa4076c  [node]
Aborted

part of my LD-Workbench configuration:

 - name: "7-000"
    iterator:
      query: "SELECT * WHERE { ?this <http://www.openarchives.org/ore/terms/isAggregatedBy> <https://archief.nl/doc/7.000spaondntfoto> }"
      #endpoint: https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie
      endpoint: https://fuseki.coret.org/na-fotocollectie/
      #endpoint: https://service.archief.nl/sparql
      batchSize: 50
    generator: 
      -  query: file://generator.rq
         batchSize: 50
@ddeboer
Copy link
Member

ddeboer commented Jul 4, 2024

These are different issues combined.

- query: file://generator.rq

Can you share your generator query? Even better, push your config to the configurations repository and link to the dump file that you’re using.

But the LD Workbench generates a SPARQL query which Qlever can't handle (yet) and I don't see how to change the LD Workbench behaviour.

LD Workbench does not generate this query, so it’s probably Comunica. Running QLever seems way too much work, so I’m not going to reproduce this locally. Compare this to a simple oxigraph start.

@coret
Copy link
Author

coret commented Jul 4, 2024

Can you share your generator query? Even better, push your config to the configurations repository and link to the dump file that you’re using.

See https://www.github.com/netwerk-digitaal-erfgoed/ld-workbench-configuration/tree/main/nafotos-sparql-endpoint for config and https://nde-europeana.ams3.cdn.digitaloceanspaces.com/7-000spaondntfoto.2.zip for a big part of the NA photocollection (for the 7-000 stage).

Compare this to a simple oxigraph start.

Have not tries oxigraph yet, will do!

@coret
Copy link
Author

coret commented Jul 4, 2024

Have not tries oxigraph yet, will do!

Have not tried your code in PR 99 but tried to start and import the 2.8GB N-triple file directly:

$docker run --rm -v ./data:/data -p 7878:7878 oxigraph/oxigraph --location /data serve --bind 0.0.0.0:7878
$curl -f -X POST http://localhost:7878/store?default -H 'Content-Type:application/n-triples' --data-binary "@7-000spaondntfoto.3.nt"
curl: option --data-binary: out of memory
curl: try 'curl --help' or 'curl --manual' for more information

Hope your src/import.ts won't be bothered by the big filesize.

@ddeboer
Copy link
Member

ddeboer commented Jul 4, 2024

Have not tries oxigraph yet, will do!

Have not tried your code in PR 99 but tried to start and import the 2.8GB N-triple file directly:

$docker run --rm -v ./data:/data -p 7878:7878 oxigraph/oxigraph --location /data serve --bind 0.0.0.0:7878
$curl -f -X POST http://localhost:7878/store?default -H 'Content-Type:application/n-triples' --data-binary "@7-000spaondntfoto.3.nt"
curl: option --data-binary: out of memory
curl: try 'curl --help' or 'curl --manual' for more information

Hope your src/import.ts won't be bothered by the big filesize.

Use curl -T instead to stream the file instead of loading the whole file in memory. The new import feature is streaming as well. Just not sure yet about the best YAML config conventions for it.

@ddeboer
Copy link
Member

ddeboer commented Jul 5, 2024

Your query returns no results, so I cannot test your pipeline. Please provide a ready-to-go reproducer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants