Pecan: Adding new data sources/datasets with PEcAn

Created on 23 Mar 2020  ·  9Comments  ·  Source: PecanProject/pecan

Hi everyone, I see many New Dataset labels with various data sources that can be integrated with PEcAn. I was planning to work on a few of them as a part of my GEE-PEcAn GSoC proposal, are there any datasets which you all would find it helpful if added to PEcAN?

Stale

Most helpful comment

Yeah, my inclination would be to use GEE as a fallback if reasonably easy-to-use APIs for the original data aren't available elsewhere. As I recall, GEE does a lot of reprojecting/resampling under the hood to make all the remote sensing imagery line up that ultimately blurs the line between what's real data and what's just resampled or interpolated. That works really well for many of their end-users, but the kind of model-data fusion work that we do with PEcAn may demand a higher level of care. That's not to say we shouldn't use GEE, just that it's usually worth spending a bit of time looking for alternative places to get any given dataset. In some cases, it will be easier and better to retrieve the data from GEE, in which case that's what we should do. But in other cases, it may be easier and better to get the data from a different source.

In particular, in addition to Mike's suggestions above, we should also keep an eye on the capabilities of the DAACs, which not only store the data but are also actively developing tools to make the data easier to retrieve and work with. For example:

  • LPDAAC manages a lot of gridded land surface remote sensing data (including Landsat)
  • ORNL DAAC is originally more focused on field data, but currently has some gridded products and airborne observations as well.
  • NSIDC DAAC is primarily for snow and ice, but also has some soil moisture data.

All 9 comments

If there are datasets that we've already tagged as of interest that are also already on GEE, then definitely add them! If you post a list of the overlap between New Dataset and GEE, we'd be happy to help prioritize.

That said, many of the dataset of interest don't live on GEE and are better handled by either different/additional automated workflows (for high-volume, standardized data) or the data ingest app (for pulling in 'long tail' data via DOI, drag-and-drop, or the APIs for generalized data repositories [e.g. DataOne]).

We're definitely interested in advancing all these tools too.

Yeah, my inclination would be to use GEE as a fallback if reasonably easy-to-use APIs for the original data aren't available elsewhere. As I recall, GEE does a lot of reprojecting/resampling under the hood to make all the remote sensing imagery line up that ultimately blurs the line between what's real data and what's just resampled or interpolated. That works really well for many of their end-users, but the kind of model-data fusion work that we do with PEcAn may demand a higher level of care. That's not to say we shouldn't use GEE, just that it's usually worth spending a bit of time looking for alternative places to get any given dataset. In some cases, it will be easier and better to retrieve the data from GEE, in which case that's what we should do. But in other cases, it may be easier and better to get the data from a different source.

In particular, in addition to Mike's suggestions above, we should also keep an eye on the capabilities of the DAACs, which not only store the data but are also actively developing tools to make the data easier to retrieve and work with. For example:

  • LPDAAC manages a lot of gridded land surface remote sensing data (including Landsat)
  • ORNL DAAC is originally more focused on field data, but currently has some gridded products and airborne observations as well.
  • NSIDC DAAC is primarily for snow and ice, but also has some soil moisture data.

Thanks, I'll try to find out the overlapping datasets. I understand in some cases it's better to directly use sources like DAACs instead of the GEE.

I also wrote down that I want to include more data sources in my data ingest app proposal. I found an R package nasapower can download NASA POWER data in R, and an R interface nasadata to access some of NASA APIs. I am not sure which way is better to go, maybe need more exploration.

@chilampoon I'm a bit wary of those two suggestions. Both contain a lot of dead links, which isn't a good sign that either is being maintained. NASA POWER appears to be a derived product focused on energy resources, not something that's high on our priority list for ingest. Taking a quick look at the nasadata package's vignette, it really reads like something written by someone who doesn't understand remote sensing (e.g. refers to Landsat 8 as 'low quality imagery'). It also appears that the access that it does provide to Landsat 8 is actually via Google Earth Engine.

This may not be a universal consensus on the PEcAn team, but I think that if you need the raw data from any specific satellite, you're now talking about a sufficiently high-volume data that it makes sense to write code specific to that API (which is what I think @ashiklom was suggesting earlier). But there, the list of what data you want is pretty important! If, on the other hand, you know you want to do a lot of preprocessing steps on the cloud to reduce the data volume of your download (and avoid doing that processing on your own machines) and are OK with any sort of reprojection/interpolation that occurs on GEE, then that service makes sense. GEE also does have the advantage of providing a single interface that is pretty darn fast. Finally, while NASA is awesome, it's not the only space agency out there producing remote sensing data that we need (indeed, I think @istfer wanted the GEE interface to be able to pull Sentinel data)

p.s. In addition to @istfer need for multispectral data (Sentinel, LANDSAT, etc), my personal wishlist for remote sensing data I'd love to see getting into PEcAn is: SMAP soil moisture and vegetation optical depth, GEDI lidar data (which has a new R package: https://github.com/carlos-alberto-silva/rGEDI), SIF from OCO2, OCO3, etc., and ECOSTRESS thermal data. Would also love for someone to tackle improving our pipeline for ingesting NOAA GOES https://doi.org/10.3390/rs11212507

Thanks for sharing @mdietze I've already prepared to integrate most of these sources. Will try to find out a way for GOES as well.

@mdietze Thanks for the reminder! I wonder if you'd like to import NOAA GOES data only or both GOES and the estimated NDVI? oops is the model proposed in that paper already added into PEcAn?

The GOES diurnal model in the Wheeler paper currently lives in its own repo outside of PEcAn. Getting that full pipeline implemented and optimized, on top of all the other remote sensing listed, is beyond the scope of GSOC, but getting the initial download more automated would definitely be a helpful/important first step

This issue is stale because it has been open 365 days with no activity.

Was this page helpful?
0 / 5 - 0 ratings