Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use with sparse/query-only data sources? #53

Open
derhuerst opened this issue Oct 3, 2019 · 22 comments
Open

how to use with sparse/query-only data sources? #53

derhuerst opened this issue Oct 3, 2019 · 22 comments

Comments

@derhuerst
Copy link

I want to built Linked Connections wrapping sparse data sources, which I need to query for connections on demand. This means that:

  • I need to be able to decide how to fetch the sparse data.
  • There is nothing stored as files.
  • The Linked Connections server should only let me fetch whatever is needed to answer the client's query.

How would this work with linked-connections-server?

@julianrojas87
Copy link
Collaborator

julianrojas87 commented Oct 4, 2019

Hi @derhuerst, could you please elaborate (maybe through an example) on what kind of queries would you like to support?

In order to increase scalability, the Linked Connections (LC) server interface has been designed to be a simplistic API that only provides documents containing a set of connections ordered by departureTime. For this, the only query parameter supported by the server is a departureTime which the server uses to respond with the document that covers the provided date-time:

https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:25:00.000Z

On top of that, the server also adds to each LC document some metadata for clients to discover more documents, namely the previous and next documents. For this, it uses hydra (a hypermedia vocabulary):

...
"hydra:next": "https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:43:00.000Z",
"hydra:previous": "https://graph.irail.be/sncb/connections?departureTime=2019-10-04T09:06:00.000Z",
...

The LC server was design mainly for route planning purposes with the Connection Scan Algorithm in mind. We follow the idea behind Linked Data Fragments where a compromise between the workload supported by both servers and clients, may lead to more scalable servers and more flexible data access. However our main interest is to investigate the trade-offs of different Web APIs, so I am certainly interested in the use-case you want to support.

@pietercolpaert
Copy link
Member

@derhuerst What do you mean by sparse data exactly? A concrete example would help. This repository is mainly intended to host a Linked Connections server from GTFS and GTFS-RT files. You can however also host a Linked Connections compliant API building it in a totally different way.

@derhuerst
Copy link
Author

derhuerst commented Oct 4, 2019

Thanks for you explanations.

I want to build a Linked Connections endpoint on top of of a sparse data source, e.g. an API where you can fetch departures/connections. This allows me to have experimental support for public transportation networks that don't publish GTFS and/or GTFS-RT feeds.

I am aware that this is terribly inefficient (as the API response time will usually be an order of magnitude higher than file/DB access) and wasteful (as one would often need to query a whole lot more information than just the connections and throw them away after), but as an experiment, I'm interested nonetheless.

What I essentially ask for is splitting the Linked Connections server from the data retrieval logic. This has several benefits in addition to support for sparse data sources:

  • I can decide which storage layer to use, e.g. AWS S3 or IndexedDB in the browser. This makes the library work in a lot more environments and use cases.
  • The smaller the individual repos/projects become, the easier it is to innovate on them: ports to other languages, alternative implementations, etc. The Protocol Labs projects are a good example.

@pietercolpaert
Copy link
Member

I want to build a Linked Connections endpoint on top of of a sparse data source, e.g. an API where you can fetch departures/connections. This allows me to have experimental support for public transportation networks that don't publish GTFS and/or GTFS-RT feeds.

I’ve been wondering for a long time whether that would be possible with e.g., HAFAS API responses, but each time I would bump into too many HTTP requests behind a Linked Connections page, as you need to do the matching between the departure and the arrival at the next station. As an experiment it might indeed be interesting nevertheless.

What I essentially ask for is splitting the Linked Connections server from the data retrieval logic. This has several benefits in addition to support for sparse data sources:

The HTTP view code is quite small though, as @julianrojas87 pointed out above. While we agree with the fact that in the future we might support others data sources (most promising: real back-ends from PTOs), we currently have not been able to identify another data source that could be workable.

Would mirroring our HTTP output work for you at this moment? The spec is pretty small: https://linkedconnections.org/specification/1-0

@derhuerst
Copy link
Author

derhuerst commented Oct 6, 2019

I've written a HAFAS-based prototype at https://github.com/derhuerst/hafas-linked-connections-server .

Can someone of you have a look if the initial direction makes sense? I tried to run the lc-client CLI against it, but it seems to be stuck in a loop. If you have any requests or comments, can you create an issue over there?

@pietercolpaert
Copy link
Member

@derhuerst That’s really cool and it does not run too slow either! Really enthusiastic by this.

lc-client hasn’t been further developed for a while (it was the initial prototype). I’ve updated the repo to reflect this. We are however heavily developing Planner.js.

@julianrojas87 @hdelva can we set up on a test server a browser build where you can type in your LC-server (defaults to localhost:3000/connections), and where it automatically calculates a route from stop A to stop B? I’d say: no prefetching and only transfers based on same stop ID (no downloading of routable tiles).

@derhuerst Something lacking is the list of stops and their geo coordinates (indeed not part of the spec, be necessary if we want to visualize it). I’ll open some issues with ideas on your repo!

@derhuerst
Copy link
Author

can we set up [...] a browser build [...] where it automatically calculates a route from stop A to stop B?

Also keep in mind that I need to be able to pick arbitrary locations by myself in order to test this out with my HAFAS-based implementation.

@pietercolpaert
Copy link
Member

Of course! That’s the reason I opened derhuerst/hafas-linked-connections-server#1

@julianrojas87
Copy link
Collaborator

Sorry for the inactivity on this issue. Lots of work travelling combined with some holidays now but will come back in a couple of weeks to complete the implementations.

@derhuerst
Copy link
Author

Since the posts above, I have built gtfs-via-postgres, yet another tool to import GTFS data into a database. It also adds a connections view, which AFAIK is semantically very close to a list of lc:Connections; It allows keeping all connections stored in the GTFS-style compacted form (not "time-expanded") with reasonably fast queries.

I now want to build a LC server that uses gtfs-via-postgres underneath, which is why I'm coming back to this thread: I think it would be worth it to isolate the HTTP server & linked data logic (paths, headers, content negotiation, geographic area, ) from the data storage logic and expose it or at least make it re-usable.

In my case, I don't need much of the complexity and dependencies in linked-connections-server, because I have already downloaded, unzipped and parsed the GTFS, and don't consume a GTFS-RT feed (yet!).

What do you think?

@julianrojas87
Copy link
Collaborator

I think it totally makes sense what you propose. The only reason it is bundled altogether is because of the convenience of having one command that does everything and because we were not too aware of Docker back then.

I guess we would need to define a common interface to read the lc:Connection pages in the same way from gtfs-via-postgres and from disk.

@derhuerst
Copy link
Author

I guess we would need to define a common interface to read the lc:Connection pages in the same way from gtfs-via-postgres and from disk.

Yeah, something like abstract-blob-store (a bit less sophisticated maybe) for Linked Connections!

@derhuerst
Copy link
Author

I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.

@julianrojas87
Copy link
Collaborator

Yeah, something like abstract-blob-store (a bit less sophisticated maybe) for Linked Connections!

Yes indeed, I was thinking the same. I had in mind something like abstract-leveldown.

I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.

That sounds great! thanks for taking it up. I will try to find some time to also start splitting the server in two different modules. But I guess I'll wait for your proposal on the abstract interface to wrap up the data storage half in it.

@derhuerst
Copy link
Author

derhuerst commented Apr 6, 2021

I'll go ahead and try to come up with such an API in a separate repo, and submit a PR once I've reached something I'm happy with.

That sounds great! thanks for taking it up. I will try to find some time to also start splitting the server in two different modules. But I guess I'll wait for your proposal on the abstract interface to wrap up the data storage half in it.

Yeah, most of the work on my proof-of-concept implementation will be transforming the HTTP/server logic to be data-source-agnostic, so there would be a lot of duplicated work. So if you're fine with that, I'll propose both an API and an express-based implementation.

@julianrojas87
Copy link
Collaborator

Sounds good to me. Please go ahead and I'll jump in once we have your proposal to avoid duplicated work.

@derhuerst
Copy link
Author

derhuerst commented Jul 10, 2022

Looks like I never gave an update, so I'll do that now, even though I didn't work on the Linked Connections side of things.

Since the posts above, I have built gtfs-via-postgres, yet another tool to import GTFS data into a database. It also adds a connections view, which AFAIK is semantically very close to a list of lc:Connections; It allows keeping all connections stored in the GTFS-style compacted form (not "time-expanded") with reasonably fast queries.

I have tweaked gtfs-via-postgres and use it for several performance-sensitive use cases where I access the GTFS data in a similar fashion (focusing on arrivals/departures instead of connections, but they're very similar storage-wise). It allows me to keep the GTFS in a relatively compact shape (roughly 4x the CSV size, e.g. 12gb with the 2.8gb Germany-wide GTFS feed) while allowing fast data access & analysis (see gtfs-via-postgres' benchmarks).

gtfs-via-postgres's connections view is quite fast if you filter by stop, station or route, but it currently is not optimised for returning connections by date+time across all stops/routes (~7s for each access).

I now want to build a LC server that uses gtfs-via-postgres underneath, which is why I'm coming back to this thread: I think it would be worth it to isolate the HTTP server & linked data logic (paths, headers, content negotiation, geographic area, ) from the data storage logic and expose it or at least make it re-usable.

About a year ago, I have built this as gtfs-linked-connections-server. I have not extracted the HTTP Linked Connections layer from it into a separate lib, but created derhuerst/gtfs-linked-connections-server#1 as a tracking Issue.

@derhuerst
Copy link
Author

gtfs-via-postgres's connections view is quite fast if you filter by stop, station or route, but it currently is not optimised for returning connections by date+time across all stops/routes (~7s for each access).

The connections view is still not optimised: Upon querying /connections?lc:departureTime, PostgreSQL will compute all connections in the dataset after the specified departure time, order them, and then return the specified number of connections (same with lc:arrivalTime in the other direction). Not sure how to optimise this while retaining the correct DST behaviour.

About a year ago, I have built this as gtfs-linked-connections-server. I have not extracted the HTTP Linked Connections layer from it into a separate lib, but created derhuerst/gtfs-linked-connections-server#1 as a tracking Issue.

gtfs-linked-connections-server now supports /connections?{lc:departureTime,lc:arrivalTime}, /connections/:id, /stops?{before,after} & /stops/:id.

I'm not sure if I got the TREE stuff right, and I haven't tried consuming with a linked--data-aware client yet. I still think this should be handled by a generic TREE server lib, where you would pass in metadata as well as data retrieval functions.

A random only somewhat related thought: It don't know Rust very well, but it seems like this generic TREE server lib would fit Rust's trait model very well, given that any other code from any unrelated domain could still easily adopt the TREE HTTP semantics.

@pietercolpaert
Copy link
Member

@derhuerst Do you want us to validate it somehow and test it with an RDF library?

@derhuerst
Copy link
Author

Do you want us to validate it somehow and test it with an RDF library?

That you be a great contribution, yes!

We could also conceive the aforementioned TREE HTTP server – I think it would make both linked-connections-server and gtfs-linked-connections-server more focused.

@pietercolpaert
Copy link
Member

Can you link me up with either an HTTP server that’s publicly reachable, or either set-up instructions in order to set up such an HTTP server locally?

@derhuerst
Copy link
Author

mkdir gtfs-lc-test
cd gtfs-lc-test

# download GTFS
wget --compression auto -r --no-parent --no-directories -R .csv.gz -P vbb-gtfs -N 'https://vbb-gtfs.jannisr.de/2022-09-09/'
rm vbb-gtfs/shapes.csv

# import GTFS
env PGDATABASE=postgres psql -c 'create database vbb_2022_09_09'
export PGDATABASE=vbb_2022_09_09
npx --package=gtfs-via-postgres@4 -- gtfs-to-sql --require-dependencies --trips-without-shape-id --stops-location-index -- vbb-gtfs/*.csv | sponge | psql -b

# serve LC server
npx derhuerst/gtfs-linked-connections-server#1.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants