how to use with sparse/query-only data sources? #53

derhuerst · 2019-10-03T18:41:55Z

I want to built Linked Connections wrapping sparse data sources, which I need to query for connections on demand. This means that:

I need to be able to decide how to fetch the sparse data.
There is nothing stored as files.
The Linked Connections server should only let me fetch whatever is needed to answer the client's query.

How would this work with linked-connections-server?

The text was updated successfully, but these errors were encountered:

julianrojas87 · 2019-10-04T09:52:09Z

Hi @derhuerst, could you please elaborate (maybe through an example) on what kind of queries would you like to support?

In order to increase scalability, the Linked Connections (LC) server interface has been designed to be a simplistic API that only provides documents containing a set of connections ordered by departureTime. For this, the only query parameter supported by the server is a departureTime which the server uses to respond with the document that covers the provided date-time:

https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:25:00.000Z

On top of that, the server also adds to each LC document some metadata for clients to discover more documents, namely the previous and next documents. For this, it uses hydra (a hypermedia vocabulary):

...
"hydra:next": "https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:43:00.000Z",
"hydra:previous": "https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:06:00.000Z",
...

The LC server was design mainly for route planning purposes with the Connection Scan Algorithm in mind. We follow the idea behind Linked Data Fragments where a compromise between the workload supported by both servers and clients, may lead to more scalable servers and more flexible data access. However our main interest is to investigate the trade-offs of different Web APIs, so I am certainly interested in the use-case you want to support.

pietercolpaert · 2019-10-04T10:17:15Z

@derhuerst What do you mean by sparse data exactly? A concrete example would help. This repository is mainly intended to host a Linked Connections server from GTFS and GTFS-RT files. You can however also host a Linked Connections compliant API building it in a totally different way.

derhuerst · 2019-10-04T11:16:50Z

Thanks for you explanations.

I want to build a Linked Connections endpoint on top of of a sparse data source, e.g. an API where you can fetch departures/connections. This allows me to have experimental support for public transportation networks that don't publish GTFS and/or GTFS-RT feeds.

I am aware that this is terribly inefficient (as the API response time will usually be an order of magnitude higher than file/DB access) and wasteful (as one would often need to query a whole lot more information than just the connections and throw them away after), but as an experiment, I'm interested nonetheless.

What I essentially ask for is splitting the Linked Connections server from the data retrieval logic. This has several benefits in addition to support for sparse data sources:

I can decide which storage layer to use, e.g. AWS S3 or IndexedDB in the browser. This makes the library work in a lot more environments and use cases.
The smaller the individual repos/projects become, the easier it is to innovate on them: ports to other languages, alternative implementations, etc. The Protocol Labs projects are a good example.

pietercolpaert · 2019-10-04T12:46:39Z

I want to build a Linked Connections endpoint on top of of a sparse data source, e.g. an API where you can fetch departures/connections. This allows me to have experimental support for public transportation networks that don't publish GTFS and/or GTFS-RT feeds.

I’ve been wondering for a long time whether that would be possible with e.g., HAFAS API responses, but each time I would bump into too many HTTP requests behind a Linked Connections page, as you need to do the matching between the departure and the arrival at the next station. As an experiment it might indeed be interesting nevertheless.

What I essentially ask for is splitting the Linked Connections server from the data retrieval logic. This has several benefits in addition to support for sparse data sources:

The HTTP view code is quite small though, as @julianrojas87 pointed out above. While we agree with the fact that in the future we might support others data sources (most promising: real back-ends from PTOs), we currently have not been able to identify another data source that could be workable.

Would mirroring our HTTP output work for you at this moment? The spec is pretty small: https://linkedconnections.org/specification/1-0

derhuerst · 2019-10-06T20:16:40Z

I've written a HAFAS-based prototype at https://github.com/derhuerst/hafas-linked-connections-server .

Can someone of you have a look if the initial direction makes sense? I tried to run the lc-client CLI against it, but it seems to be stuck in a loop. If you have any requests or comments, can you create an issue over there?

pietercolpaert · 2019-10-07T09:33:27Z

@derhuerst That’s really cool and it does not run too slow either! Really enthusiastic by this.

lc-client hasn’t been further developed for a while (it was the initial prototype). I’ve updated the repo to reflect this. We are however heavily developing Planner.js.

@julianrojas87 @hdelva can we set up on a test server a browser build where you can type in your LC-server (defaults to localhost:3000/connections), and where it automatically calculates a route from stop A to stop B? I’d say: no prefetching and only transfers based on same stop ID (no downloading of routable tiles).

@derhuerst Something lacking is the list of stops and their geo coordinates (indeed not part of the spec, be necessary if we want to visualize it). I’ll open some issues with ideas on your repo!

derhuerst · 2019-10-09T08:28:47Z

can we set up [...] a browser build [...] where it automatically calculates a route from stop A to stop B?

Also keep in mind that I need to be able to pick arbitrary locations by myself in order to test this out with my HAFAS-based implementation.

pietercolpaert · 2019-10-09T08:32:47Z

Of course! That’s the reason I opened derhuerst/hafas-linked-connections-server#1

julianrojas87 · 2019-11-03T00:01:48Z

Sorry for the inactivity on this issue. Lots of work travelling combined with some holidays now but will come back in a couple of weeks to complete the implementations.

derhuerst · 2021-04-05T21:58:36Z

Since the posts above, I have built gtfs-via-postgres, yet another tool to import GTFS data into a database. It also adds a connections view, which AFAIK is semantically very close to a list of lc:Connections; It allows keeping all connections stored in the GTFS-style compacted form (not "time-expanded") with reasonably fast queries.

I now want to build a LC server that uses gtfs-via-postgres underneath, which is why I'm coming back to this thread: I think it would be worth it to isolate the HTTP server & linked data logic (paths, headers, content negotiation, geographic area, ) from the data storage logic and expose it or at least make it re-usable.

In my case, I don't need much of the complexity and dependencies in linked-connections-server, because I have already downloaded, unzipped and parsed the GTFS, and don't consume a GTFS-RT feed (yet!).

What do you think?

julianrojas87 · 2021-04-06T08:14:48Z

I think it totally makes sense what you propose. The only reason it is bundled altogether is because of the convenience of having one command that does everything and because we were not too aware of Docker back then.

I guess we would need to define a common interface to read the lc:Connection pages in the same way from gtfs-via-postgres and from disk.

derhuerst · 2021-04-06T11:52:38Z

I guess we would need to define a common interface to read the lc:Connection pages in the same way from gtfs-via-postgres and from disk.

Yeah, something like abstract-blob-store (a bit less sophisticated maybe) for Linked Connections!

derhuerst · 2021-04-06T12:05:36Z

I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.

julianrojas87 · 2021-04-06T12:52:35Z

Yeah, something like abstract-blob-store (a bit less sophisticated maybe) for Linked Connections!

Yes indeed, I was thinking the same. I had in mind something like abstract-leveldown.

I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.

That sounds great! thanks for taking it up. I will try to find some time to also start splitting the server in two different modules. But I guess I'll wait for your proposal on the abstract interface to wrap up the data storage half in it.

derhuerst · 2021-04-06T13:40:09Z

I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.

That sounds great! thanks for taking it up. I will try to find some time to also start splitting the server in two different modules. But I guess I'll wait for your proposal on the abstract interface to wrap up the data storage half in it.

Yeah, most of the work on my proof-of-concept implementation will be transforming the HTTP/server logic to be data-source-agnostic, so there would be a lot of duplicated work. So if you're fine with that, I'll propose both an API and an express-based implementation.

julianrojas87 · 2021-04-06T14:15:36Z

Sounds good to me. Please go ahead and I'll jump in once we have your proposal to avoid duplicated work.

derhuerst · 2022-07-10T16:29:36Z

Looks like I never gave an update, so I'll do that now, even though I didn't work on the Linked Connections side of things.

Since the posts above, I have built gtfs-via-postgres, yet another tool to import GTFS data into a database. It also adds a connections view, which AFAIK is semantically very close to a list of lc:Connections; It allows keeping all connections stored in the GTFS-style compacted form (not "time-expanded") with reasonably fast queries.

I have tweaked gtfs-via-postgres and use it for several performance-sensitive use cases where I access the GTFS data in a similar fashion (focusing on arrivals/departures instead of connections, but they're very similar storage-wise). It allows me to keep the GTFS in a relatively compact shape (roughly 4x the CSV size, e.g. 12gb with the 2.8gb Germany-wide GTFS feed) while allowing fast data access & analysis (see gtfs-via-postgres' benchmarks).

gtfs-via-postgres's connections view is quite fast if you filter by stop, station or route, but it currently is not optimised for returning connections by date+time across all stops/routes (~7s for each access).

I now want to build a LC server that uses gtfs-via-postgres underneath, which is why I'm coming back to this thread: I think it would be worth it to isolate the HTTP server & linked data logic (paths, headers, content negotiation, geographic area, ) from the data storage logic and expose it or at least make it re-usable.

About a year ago, I have built this as gtfs-linked-connections-server. I have not extracted the HTTP Linked Connections layer from it into a separate lib, but created derhuerst/gtfs-linked-connections-server#1 as a tracking Issue.

derhuerst · 2022-09-09T13:00:28Z

gtfs-via-postgres's connections view is quite fast if you filter by stop, station or route, but it currently is not optimised for returning connections by date+time across all stops/routes (~7s for each access).

The connections view is still not optimised: Upon querying /connections?lc:departureTime, PostgreSQL will compute all connections in the dataset after the specified departure time, order them, and then return the specified number of connections (same with lc:arrivalTime in the other direction). Not sure how to optimise this while retaining the correct DST behaviour.

About a year ago, I have built this as gtfs-linked-connections-server. I have not extracted the HTTP Linked Connections layer from it into a separate lib, but created derhuerst/gtfs-linked-connections-server#1 as a tracking Issue.

gtfs-linked-connections-server now supports /connections?{lc:departureTime,lc:arrivalTime}, /connections/:id, /stops?{before,after} & /stops/:id.

I'm not sure if I got the TREE stuff right, and I haven't tried consuming with a linked--data-aware client yet. I still think this should be handled by a generic TREE server lib, where you would pass in metadata as well as data retrieval functions.

A random only somewhat related thought: It don't know Rust very well, but it seems like this generic TREE server lib would fit Rust's trait model very well, given that any other code from any unrelated domain could still easily adopt the TREE HTTP semantics.

pietercolpaert · 2022-09-10T09:30:38Z

@derhuerst Do you want us to validate it somehow and test it with an RDF library?

derhuerst · 2022-09-10T09:34:23Z

Do you want us to validate it somehow and test it with an RDF library?

That you be a great contribution, yes!

We could also conceive the aforementioned TREE HTTP server – I think it would make both linked-connections-server and gtfs-linked-connections-server more focused.

pietercolpaert · 2022-09-10T09:56:30Z

Can you link me up with either an HTTP server that’s publicly reachable, or either set-up instructions in order to set up such an HTTP server locally?

derhuerst · 2022-09-10T11:57:29Z

mkdir gtfs-lc-test
cd gtfs-lc-test

# download GTFS
wget --compression auto -r --no-parent --no-directories -R .csv.gz -P vbb-gtfs -N 'https://vbb-gtfs.jannisr.de/2022-09-09/'
rm vbb-gtfs/shapes.csv

# import GTFS
env PGDATABASE=postgres psql -c 'create database vbb_2022_09_09'
export PGDATABASE=vbb_2022_09_09
npx --package=gtfs-via-postgres@4 -- gtfs-to-sql --require-dependencies --trips-without-shape-id --stops-location-index -- vbb-gtfs/*.csv | sponge | psql -b

# serve LC server
npx derhuerst/gtfs-linked-connections-server#1.2.1

derhuerst mentioned this issue Apr 8, 2021

refactor serve-linked-connections into serve-linked-tree derhuerst/gtfs-linked-connections-server#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to use with sparse/query-only data sources? #53

how to use with sparse/query-only data sources? #53

derhuerst commented Oct 3, 2019

julianrojas87 commented Oct 4, 2019 •

edited

Loading

pietercolpaert commented Oct 4, 2019

derhuerst commented Oct 4, 2019 •

edited

Loading

pietercolpaert commented Oct 4, 2019

derhuerst commented Oct 6, 2019 •

edited

Loading

pietercolpaert commented Oct 7, 2019

derhuerst commented Oct 9, 2019

pietercolpaert commented Oct 9, 2019

julianrojas87 commented Nov 3, 2019

derhuerst commented Apr 5, 2021

julianrojas87 commented Apr 6, 2021

derhuerst commented Apr 6, 2021

derhuerst commented Apr 6, 2021

julianrojas87 commented Apr 6, 2021

derhuerst commented Apr 6, 2021 •

edited

Loading

julianrojas87 commented Apr 6, 2021

derhuerst commented Jul 10, 2022 •

edited

Loading

derhuerst commented Sep 9, 2022

pietercolpaert commented Sep 10, 2022

derhuerst commented Sep 10, 2022

pietercolpaert commented Sep 10, 2022

derhuerst commented Sep 10, 2022

how to use with sparse/query-only data sources? #53

how to use with sparse/query-only data sources? #53

Comments

derhuerst commented Oct 3, 2019

julianrojas87 commented Oct 4, 2019 • edited Loading

pietercolpaert commented Oct 4, 2019

derhuerst commented Oct 4, 2019 • edited Loading

pietercolpaert commented Oct 4, 2019

derhuerst commented Oct 6, 2019 • edited Loading

pietercolpaert commented Oct 7, 2019

derhuerst commented Oct 9, 2019

pietercolpaert commented Oct 9, 2019

julianrojas87 commented Nov 3, 2019

derhuerst commented Apr 5, 2021

julianrojas87 commented Apr 6, 2021

derhuerst commented Apr 6, 2021

derhuerst commented Apr 6, 2021

julianrojas87 commented Apr 6, 2021

derhuerst commented Apr 6, 2021 • edited Loading

julianrojas87 commented Apr 6, 2021

derhuerst commented Jul 10, 2022 • edited Loading

derhuerst commented Sep 9, 2022

pietercolpaert commented Sep 10, 2022

derhuerst commented Sep 10, 2022

pietercolpaert commented Sep 10, 2022

derhuerst commented Sep 10, 2022

julianrojas87 commented Oct 4, 2019 •

edited

Loading

derhuerst commented Oct 4, 2019 •

edited

Loading

derhuerst commented Oct 6, 2019 •

edited

Loading

derhuerst commented Apr 6, 2021 •

edited

Loading

derhuerst commented Jul 10, 2022 •

edited

Loading