Agenda
- short catch up since last meeting(s)
- Data ingest / crawler discussion - existing approaches:
- ESGF (CEDA ingest tool, and ESGF publisher)
-
stac-generator
- crawl --> extract --> elastic_search ingest
-
esg-publisher
- (crawled) --> extract --> publish (on kafka) --> remote stac ingest
- (both: kerchunk/sidecar file generation .. tool??)
-
stac-generator
- Warmworld
- static / ad hoc tool generate --> ?static catalog gen? / ?test stac ingest?
- Expect (cross catalog and non-ESGF catalogs)
- static cross catalog overlay ??
- EERIE
- see cloudify
- NextGems etc: intake
- (intake --> stac) !?
- (Freva: tbd. in February meeting)
- crawler --> direct solr ingest --> (stac gen?)
- ESGF (CEDA ingest tool, and ESGF publisher)
- status test servers, prototyping setup etc.
Discussion
- DKRZ metadata crawler / indexer approach: build new - reuse existing approaches etc.
- ceda indexer: developer left, very generic tool, relatively good code base but more generic then what we need, generation of kerchunk / aggregation done separately - not clear if it has major advantages to build on this ...
- esgf-pub: crawling / gridmapfile generation done separately anyway, not a good code base, quite CMIP/ESGF specific, unclear future development roadmap (current funding problems, freeze situation) ..
- eerie approach: zarr / kerchunking approach central, yet we probably can not borrow much from other approaches (esgf etc.)
- freva indexer: disadvantages are that we need to build our higher aggregation levels based on the freva base level, these devs. are then quite dependant on the freva/solr base layer, which is old and dkrz specific. Major advantage would be that we could relay on a shared stable production dkrz indexing solution
- separate discussion about catalog of catalog approach and specific DKRZ catalog solution
action items:
- continue discussion with freva people
- @carsten: look once again into the ceda crawling approach, to see whether there is a major advantage on reusing/building upon their code base
- carsten/fabi: discuss kerchunking approach - virtualzarr etc. , follow new zarr3 related approaches
- continue discussion as part of our regular thursday meetings ..
figures for guiding discussions:
status vs. ideal
metacat perspective
Catalogs to play with / link to in test env:
- DestinE, e.g. https://hda.data.destination-earth.eu/ui/catalog
- CEDA CMIP6 STAC, via stacbrowser
- EERIE cloud
from diagrams import Diagram, Cluster
from diagrams.aws.compute import EC2
from diagrams.aws.database import RDS
from diagrams.aws.network import ELB
with Diagram("Web Service", show=False) as diag:
with Cluster("STAC Catalogs"):
cat1 = RDS("ESGF EAST STAC Catalog")
cat2 = RDS("DKRZ proj Catalog")
cat3 = RDS("EERIE Catalog")
cat4 = RDS("WW Catalog")
cat5 = RDS("DKRZ NN Catalog, e.g. DestinE")
ELB("STAC MetaCat") >> [cat1,cat2,cat3,cat4,cat5]
diag