Cloud Data Storage #110

grayjones · 2023-02-09T22:03:51Z

grayjones
Feb 9, 2023

I have started looking at options for storing the sensor data in the cloud. From what I've seen so far it looks like we are currently storing the data with InfluxDB on their free cloud plan. This only stores data for 30 days so we do need to act quickly to avoid losing data. To my mind there are two potential quick actions to take to not lose data

go ahead and upgrade to 'Usage-Based Plan' now to avoid losing data and also get a feel for costs
do a backup or data dump of the data in influxdb so we can stitch it back together later

On a longer term view my initial thoughts are that we need to balance two competing factors when making choices. We need to be cost aware so that we pay the bills as we grow but we also don't want to create a situation where we need a lot of monitoring and system maintenance - we are all volunteers and I doubt anybody wants to take that on (I don't). Maybe as we get more volunteers we will get folks with devops experience who do want to take it on?

I am currently poking around platform.sh trying to get a feel for what services are available to us as well as trying to understand how the data is currently flowing. Here are my initial questions

how do the devices currently send data?
I believe they are sending data every 5 minutes to a GCP function. What does that data look like?
Are they authenticating themselves?
Do we have control over the url they submit to?

My first take was that this is a streaming data application. On the backend this implies the use of Apache Kafka or AWS Kinesis (don't know the gcp version). Then the data could get written to a database and exposed via an api. The stream could have multiple subscribers if we wanted to do any aggregation or monitoring in the future

Below are two AWS products that might be useful. I don't have experience with either. My experience has been with the underlying tools/technologies that these services most likely use. I would imagine they could get expensive

https://aws.amazon.com/free/iot/
https://aws.amazon.com/timestream/

keenanjohnson · 2023-02-10T01:09:12Z

keenanjohnson
Feb 10, 2023
Maintainer

Thanks for putting some thought into this @grayjones!

Short Term
I agree that it would be better to prevent data loss, but everyone that builds a sensor right now sort of accepts that the data is only stored for 30 days, so it's not so urgent. I think it could be worth it to enable it to get a sense for the real costs we might incur with influx given our small fleet today, however

Answers to Questions

Q: How do the devices currently send data?
- A: V4 devices send data over either an ethernet, wifi, or cellular network connection. The data chain is as follows and described poorly in this diagram: Sensor -> Golioth via coap protocol -> Google Cloud Function via a Webhook -> InfluxDB via the InfluxPython Client. Note that V1-V3 sensors ran linux and sent data to influx directly. I don't think we should focus on V3 devices for now and think of the future which is V4.
- V4: Device Code
- V3: Device Code
Q: I believe they are sending data every 5 minutes to a GCP function. What does that data look like?
- A: The data is sent to the GCP function as JSON via a webhook. I have attached the gcp function source below:
Q: Are they authenticating themselves?
- A: The latest version of the sensor (V4) are authenticated to golioth.io's services via a preshared key which is unique to each device. Note, we need to build out a proper sensor on-boarding flow that creates these keys automatically: github issue.
Q: Do we have control over the URL they submit to?
- A: The URL of the google function is configurable I think. Stack overflow reference

import functions_framework
from datetime import datetime, timedelta
from influxdb_client import InfluxDBClient, Point, WritePrecision  # type: ignore
from influxdb_client.client.write_api import SYNCHRONOUS  # type: ignore

@functions_framework.http
def send_data_2_influx(request):
    request_json = request.get_json(silent=True)
    request_args = request.args

    golioth_msg = request_json
    ribbit_msg = golioth_msg['data']['ribbitnetwork.datapoint']
    gps_fix = ribbit_msg['gps']['has_fix']

    # Check if the gps fix is valid, only send a datapoint if so
    if gps_fix:
        client = InfluxDBClient(url="https://us-west-2-1.aws.cloud2.influxdata.com/", token=INFLUX_TOKEN", org=ORG)
        write_api = client.write_api(write_options=SYNCHRONOUS)

        # Get metadata
        device_id = golioth_msg['device_id'] + '_golioth_esp32s3'
        timestamp = datetime.utcnow()
        
        # Parse Data
        scd30_co2_ppm = ribbit_msg['scd30']['co2']
        scd30_humidity = ribbit_msg['scd30']['humidity']
        scd30_temp = ribbit_msg['scd30']['temperature']

        gps_lat = ribbit_msg['gps']['latitude']
        gps_lon = ribbit_msg['gps']['longitude']
        gps_alt = ribbit_msg['gps']['altitude']
        
        dps310_pressure = ribbit_msg['dps310']['pressure']
        dps310_temp = ribbit_msg['dps310']['temperature']

        point = (
            Point("ghg_point")
            .tag("host", device_id)
            .field("co2", scd30_co2_ppm)
            .time(timestamp, WritePrecision.NS)
            .field("temperature", scd30_temp)
            .field("humidity", scd30_humidity)
            .field("lat", gps_lat)
            .field("lon", gps_lon)
            .field("alt", gps_alt)
            .field("baro_pressure", dps310_pressure)
            .field("baro_temperature", dps310_temp)
        )
        write_api.write(BUCKET, ORG, point)

    #Must send an HTTP Return Code
    return 'OK'

0 replies

beriberikix · 2023-02-13T14:38:57Z

beriberikix
Feb 13, 2023

It sounds like the project wants to have a single source of truth for the data, which totally makes sense. Since InfluxDB was used for V1-V3, keeping that the system of record also generally makes sense. The question I haven't seen asked is, "what's the intended use of the data?"

Is it:

To power the app (ex. real-time data)
Part of a data platform for research and data scientists
A simple backup
Something else?

Golioth will happily store sensor readings reliably and can be used for say real-time app data. However, it doesn't replace a full-fledged DB and most of our users stream that data somewhere else, like a data platform.

As a different note - you could probably simplify the integration by using our native Google Cloud Pub/Sub integration + InfluxDB's native Pub/Sub telegraf plugin. Both would be fully managed and potentially more reliable/performant than a custom cloud function. But YMMV and I've never used the Telegraf integration. :)

5 replies

grayjones Feb 13, 2023
Author

I think the intended use of the data is for the top two items you mentioned. We need to be able to populate the front page map with real time locations of sensors (and maybe their status). We also need to provide the historical data for an individual sensor upon drill down. We will also be exposing an API that would allow folks to get historical data for a given sensor.

If we could use Golioth to populate the front page map with real time data then that might be a nice option as that query into influxDB is fairly expensive.

I love the potential pub/sub telegraf approach. That would make for an extremely low maintenance architecture. I need to do some more research on google pub/sub and telegraf - all new to me coming from an AWS world

keenanjohnson Feb 13, 2023
Maintainer

fyi I briefly explored telegraf, but there is no managed instance of telegraf which is provided by the InfluxDB Cloud or any other service for that matter I could find. It didn't seem like a great choice and/or was beyond my skills to quickly set up a telegraf instance on a normal running host due to the burden of keeping that service highly available for the open-source team.

grayjones Feb 13, 2023
Author

ok yeah that's good input. if influxdb cloud is not hosting the telegraph server process then I agree that the current custom cloud function is the way to go. Do we have an idea of the costs for running that in gcp? Because that would need to be added to the influxdb costs to get a feel for what the solution costs on a per sensor basis

keenanjohnson Feb 14, 2023
Maintainer

So far I have been below any billing thresholds. Do you know if there is a way to estimate it?

grayjones Feb 15, 2023
Author

Here's the link to google cloud function pricing. It looks like we will be under the free-tier limits for awhile yet

https://cloud.google.com/functions/pricing

grayjones · 2023-02-13T16:03:54Z

grayjones
Feb 13, 2023
Author

I just had a good conversation with Erin - account manager at InfluxDB. We discussed what pricing would look like if we were to upgrade from the free cloud tier to the paid cloud tier. Doing so would allow us to retain the sensor data for more than 30 days.

This is back of napkin level math
If we were on the paid plan our cost for the past month would have been about $1.00. If we had 10 active units then the cost per month for each unit would equate to 10 cents/month

Some factors that would effect costs would be how long we want to retain the data. If it's infinite then the storage costs would increase over time. The number of queries would also increase the costs. Currently it looks like we are paying about 2 cents per query . If the web page were to become very popular then the number of queries would increase and so would costs

So for now I am using the 10 cents per month number to help get a feel for the storage costs in a managed cloud environment.

4 replies

keenanjohnson Feb 14, 2023
Maintainer

Ok perfect. If that's what seems like it makes the most sense, we can definitely do that!

grayjones Feb 15, 2023
Author

I do think this makes sense for us right now. This will solution won't require much management on our end. We can focus on creating an api to expose the data.

One thing to note is that cloud pricing is always dynamic. It's always a possibility that we get a spike in usage that results in higher fees. So just make sure whatever credit cards are attached to google and influxdb can handle a potentially larger fee then normal for any given month

keenanjohnson Feb 15, 2023
Maintainer

Good point @grayjones ! I know that all of the cloud services can provide billing caps. Can you look into what would be reasonable for us to set up on the google functions?

grayjones Feb 20, 2023
Author

I did some research on billing caps for google cloud functions. From what I can tell google provides billing 'alerts' that can send an alert but not actually suspend the service. I think we can start with a pretty low threshold - maybe $10. We can raise it later if we need to.

I also did research on the influxdb and it doesn't appear that they provide a hard cap. They also have the concept of defining alerts based on activity. We could set up one in influxdb that would alert if there is an unusual spike

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Data Storage #110

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Cloud Data Storage #110

grayjones Feb 9, 2023

Replies: 3 comments · 9 replies

keenanjohnson Feb 10, 2023 Maintainer

beriberikix Feb 13, 2023

grayjones Feb 13, 2023 Author

keenanjohnson Feb 13, 2023 Maintainer

grayjones Feb 13, 2023 Author

keenanjohnson Feb 14, 2023 Maintainer

grayjones Feb 15, 2023 Author

grayjones Feb 13, 2023 Author

keenanjohnson Feb 14, 2023 Maintainer

grayjones Feb 15, 2023 Author

keenanjohnson Feb 15, 2023 Maintainer

grayjones Feb 20, 2023 Author

grayjones
Feb 9, 2023

Replies: 3 comments 9 replies

keenanjohnson
Feb 10, 2023
Maintainer

beriberikix
Feb 13, 2023

grayjones Feb 13, 2023
Author

keenanjohnson Feb 13, 2023
Maintainer

grayjones Feb 13, 2023
Author

keenanjohnson Feb 14, 2023
Maintainer

grayjones Feb 15, 2023
Author

grayjones
Feb 13, 2023
Author

keenanjohnson Feb 14, 2023
Maintainer

grayjones Feb 15, 2023
Author

keenanjohnson Feb 15, 2023
Maintainer

grayjones Feb 20, 2023
Author