Example datasets for training (human and machine) - for CF roadmap #355
Replies: 12 comments 18 replies
-
Could this 'collection' of example datasets be searchable? |
Beta Was this translation helpful? Give feedback.
-
@agstephens it was great to discuss this with you at the workshop last week. Would be good to hear your thoughts on how we proceed for the purpose of the roadmap publication. |
Beta Was this translation helpful? Give feedback.
-
What are the desirable characteristics of these example datasets? Are we talking idealized examples where everything is perfect (but perhaps without much real data)? Or real-world datasets that may have imperfections but are widely used? Something else? |
Beta Was this translation helpful? Give feedback.
-
I would say "everything is perfect" examples, that's kind of the point :-) Many of us learn much more bey example, so if someone can find an example that's similar to their use case, it will be much easier to figure out what to do.
Yes, I think the actual data isn't the point, so ideally, they'd be trimmed down and small... |
Beta Was this translation helpful? Give feedback.
-
I think it depends what we are using these datasets for. For human training, everything should be perfect. It could be that synethetic examples (no real data) could be used for this. Large language models like ChatGPT could help us make these more quickly than we might've been able to do so a few years ago, though this will still take a lot of manual work. Related to this are profiles of CF, e.g. the CF-WMO profiles. These profiles outline explicitely, with fewer degrees of freedom, how to encode certain types of data. Therefore I think there is some overlap with https://github.com/orgs/cf-convention/discussions/358 here. For machine training, I don't think everything needs to be perfect if we have enough datasets. It could actually be beneficial include labelled imperfections, so the machine could learn what not to do. Though this hypothesis requires a lot of testing... |
Beta Was this translation helpful? Give feedback.
-
For machine training, in order to gather a large enough data collection (1000+ ???), we may have to use real datasets. We discussed at the workshop that multiple examples from the same collection (e.g. datasets for different days) could be used. |
Beta Was this translation helpful? Give feedback.
-
If the data itself isn't particularly important, and only the metadata, and I think this is generally true, then how about we have our examples in header-only CDL notation form_ - to keep everything as small as we can i.e. no data arrays included (which are the part that can make files large and therefore difficult to store and transfer), without losing any metadata information? This can be generated from the 'real' datasets quite easily, using |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, apologies for the delay in getting to this thread. I originally proposed this idea at the CF Workshop (2024) so let me share some high-level thoughts based on the discussions we had in the brief hackathon, reading your comments above, and some further thinking. I'll try to break it down as follows:
1. What we tried/learnt at the CF Workshop 20241.1. Large Language Models and CFWe talked about training datasets for a Large Language Model (LLM) such as ChatGPT and some of key points in our discussion were:
1.2. Creating example datasetsRegarding the creation of a high-quality dataset:
2. Some thoughts about how we might progress this workHaving experimented very briefly with prompting an LLM (ChatGPT) it is clear that there is some promise. There is also a danger that we could put a lot of effort into something that becomes obsolete within months or years. I can think of 6 different areas/approaches that we could take forward. It would be great to hear which people think is the best combination, and in what order we could tackle them:
2.1. Define what we want from a CF-AgentA real danger is that effort is put into playing with a "CF-Agent" but there is no clear goal, leading to a vague outcome. We might benefit from defining a set of use cases and/or requirements, such as:
2.2. Create performance metrics and testsAny work done with ML models requires us to be able to recognise when the AI is doing a good job. We would need to define performance metrics and repeatable tests that could assess the quality of the model/application. The tests would also enable regression testing when updates were made to parts of the system outside of our control. 2.3. Small file set to support the CF documentationThis has been mentioned above in section 1.2. Ideally there would be a set of files to support:
2.4. Large file set to support ML applicationsThis has been mentioned above in section 1.2. Some of the key considerations would be:
2.5. Try a range of ML approachesWe should consider a hierarchy of approaches, from the quickest/cheapest first:
2.6. Constrain the problemMany of the above options appear to involve a lot of work. How can we constrain the problem, so that we can try something out quickly? Here are some ideas, please add more:
If we find a good solution to these cut-down problems, we can scale them up. (Apologies this is so long - thanks for staying with it - I look forward to hearing your thoughts and working on it with you :-) ) |
Beta Was this translation helpful? Give feedback.
-
+1 on this -- I was recently looking at the ragged array format for trajectories, and it was really hard to wrap. my haed around it without and example with data in it. |
Beta Was this translation helpful? Give feedback.
-
There seems to be some overlap with this old issue: cf-convention/cf-conventions#348 |
Beta Was this translation helpful? Give feedback.
-
It has been a while since there has been any activity on this. Does anyone have anything they want to add? |
Beta Was this translation helpful? Give feedback.
-
Hi @lhmarsden @ethanrd @ChrisBarker-NOAA @sethmcg @sadielbartholomew, On Tuesday 14th January, 1300 UK time: you are welcome to join a chat about this CF discussion topic. Please join here if you are interested: |
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
Example datasets would be useful as training datasets for human users. These training datasets could also potentially be used for training a GPT (for example) to 'understand' AI in order to help human users with questions.
This discussion relates to this theme in the CF roadmap publication that we are collectively working on; the 5 year plan.
Beta Was this translation helpful? Give feedback.
All reactions