Opened 7 years ago

Closed 6 years ago

#17 closed enhancement (fixed)

Remove ambiguity in cell_methods, especially means over subgrid areas

Reported by: jonathan Owned by: cf-conventions@…
Priority: medium Milestone:
Component: cf-conventions Version:
Keywords: Cc:

Description

1. Title

Remove ambiguity about statistics described by cell_methods, especially
means over subgrid areas

2. Moderator

Karl Taylor

3. Requirement

To indicate the portion of a cell over which a statistic has been calculated,
in situations where there is a need to distinguish between statistics
calculated for the same quantity over different portions of the cell, or where
the quantity might be considered to be undefined over some portion of the
cell. The statistic is usually a mean or a sum. The situation most often
arises with cells in the horizontal, where the "portions" are different types
of surface that don't have defined geographical boundaries. Some examples:

  1. sea_ice_thickness averaged over the area of sea ice, the area of sea, or

the entire area of the cell including land. These can all be written as A/B,
where A is the total volume of sea ice in the cell, and B is the area of sea
ice, the area of sea or the area of the cell.

  1. surface_upward_sensible_heat_flux averaged over different surface types

within the cell e.g. land, sea, land ice, forest. These might similarly be
written as A/B, A (in W) being the area-integral of the flux applying to the
given surface type, and B (in m2) either the area occupied by that type, or
the cell area. Alternatively, this means the flux (W m-2) is expressed either
per unit area of the particular surface type, or per unit area of the grid
cell. When the values for different types are given per unit area of the cell,
the sum of these values over all types is the mean for the cell as a whole.

  1. surface_temperature averaged over different surface types within the

cell. This is only likely to be given as a average value for each surface
type i.e. formally where B is the area of that type e.g. the temperature is
300 K over the land portion of the cell, 310 K over the forest portion.

4. Initial Statement of Technical Proposal

In the standard_name guidelines, this issue is partly addressed by using
where-phrases. However, this approach is unclear and inadequate. It can't
indicate, for instance, whether the sensible heat flux applying to the land
portion of the box is expressed per unit area of land or per unit area of the
cell. The present proposal follows the one made on the email list in December
2006 http://www.cgd.ucar.edu/pipermail/cf-metadata/2006/001449.html. Following
our discussion in Paris in June 2007, the proposal extends the use of
cell_methods and coordinates to indicate subgrid variation more precisely,
and eliminates where-phrases from standard names.

  1. If there is no cell_methods specified, the default interpretation for an

intensive quantity is "point", which means a local value in area or an
instantaneous value in time, and "sum" for an extensive quantity, meaning the
sum over area or time in the cell. No change is proposed to this: it is
unproblematic, because point values and integrals do not involve dividing by
anything. It is undefined what value should be given if the quantity does not
exist e.g. for sea_ice_thickness where there is no sea ice, the value could be
zero or missing, as either would make sense for a point value.

  1. Delete the existing standard names with where-phrases, making them aliases

of names without the where-phrases. There are only nine of them at present:

precipitation_flux_onto_canopy_where_land
surface_net_downward_radiative_flux_where_land
surface_snow_thickness_where_sea_ice
surface_temperature_where_land
surface_temperature_where_open_sea
surface_temperature_where_snow
surface_upward_sensible_heat_flux_where_sea
water_evaporation_flux_from_canopy_where_land
water_evaporation_flux_where_sea_ice
  1. Define a new standard_name of area_type, whose values could be any of

the surface_cover types as well as any distinctions of horizontal area which
are not surface types, such as "cloud". It is not proposed to standardise the values of area_type at present, but they could be standardised later.

  1. To provide for greater use of string-valued auxiliary coordinate variables,

especially string-valued scalar coordinate variables:

  • To the end of the first paragraph of 6.1, append: Other purposes for string identifiers are also described in Section 6.1.1, "Geographic Regions", and Section 7.3.3, "Statistics applying to portions of cells".
  • To the end of the second paragraph of 6.1, append: If a character variable has only one dimension (the length of the string), it is regarded as a string-valued scalar coordinate variable, analogous to a numeric scalar coordinate variable (Section 5.7).
  • Modify the section on 6.1 in the conformance document to read: A variable of character type that is named by a coordinates attribute is a label variable. This variable must have one or two dimensions. The trailing (CDL order) or sole dimension is for the maximum string length. If there are two dimensions, the leading dimension (CDL order) must match one of those of the data variable.
  1. A cell_methods entry is generically of the form "name: [name:

...] method" (see CF 7.3), where names are the names of dimensions,
scalar coordinate variables, or standard_names. Horizontal area-means are
indicated by "lat: lon: mean", if lat and lon are the
latitude and longitude dimensions. I propose to introduce a special name
of area to indicate horizontal area, so an area-mean can be written "area:
mean". This is more obvious and convenient.

To do this, modify the paragraph in 7.3 beginning "If a data value is
representative of variation over a combination of axes" by changing "a
longitude-latitude gridbox would have" to "... could have", and appending the
following:

To indicate variation over horizontal area, a special name of area is permitted as an alternative to specifying a combination of dimensions. The common case of an area-mean in longitude-latitude gridboxes can thus be shown by cell_methods="area: mean". If they are not longitude and latitude, the horizontal coordinate variables can be identified with axis attributes of X and Y (see Chapter 4, Coordinate Types).

  1. Since CF 7.3 on Cell methods is quite long, I propose
  • To rename 7.3 as Statistical variation within cells. This title would explain more of what it is about and parallels the title of 7.4 on Climatological statistics.
  • To insert subsection headings of 7.3.1 Statistics for more than one axis starting with the paragraph "If more than one ...", 7.3.2 Recording the spacing of the original data and other information starting "To indicate more precisely" and 7.3.4 Use of standard names in cell methods starting "The convention of specifying". (A new subsection on portions of cells will be inserted as 7.3.3 between the second and third existing subsections.)
  1. Insert a new subsection in CF 7.3 entitled Statistics applying to

portions of cells before the paragraph beginning "The convention of
specifying", as follows:

By default, the statistical method indicated by cell_methods is assumed to have been evaluated over the entire cell. Sometimes it is necessary to evaluate different values of a quantity for different portions of a cell. To indicate this, one of two conventions may be used.

The first convention is the more general. In this convention, a string-valued coordinate variable or string-valued scalar coordinate variable (see Section 6.1, "Labels") indicates the portion of the cell. Variables with standard_names of land_cover, surface_cover or area_type are suitable. With this approach, a coordinate variable with dimension greater than one would allow values of a quantity to be given for various area types in one data variable, as is often needed in land surface models for example, since they deal with many types within each surface gridbox. In this convention, the cell_methods entry is of the form "name: method" as usual, where name could be area, but the statistical method applies to the selected portion of the cell only e.g. a mean over the sea-ice area.

The second convention is a shorthand for the commonest cases. In this convention, a cell_methods entry may be given of the form "name: method where type", in which type may be land, sea, sea_ice, or open_sea (sea area not occupied by sea ice). The phrase "where type" should be interpreted as exactly equivalent to supplying a scalar or size-one coordinate variable of area_type with value type.

Example. Means over land and sea.

dimensions:
  lat=73;
  lon=96;
  maxlen=20;
  lc2=2;
variables:
  float surface_temperature(lat,lon);
    surface_temperature:cell_methods="area: mean where land";
  float surface_upward_sensible_heat_flux(lc2,lat,lon);
    surface_upward_sensible_heat_flux:coordinates="land_cover2";
    surface_upward_sensible_heat_flux:cell_methods="area: mean";
  char land_cover2(lc2,maxlen);
data:
  land_cover2="land","sea";

In any case, other coordinate variables may also implicitly restrict the portion of the cell considered by the statistical method. For example, the horizontal area of the ocean decreases with increasing depth. An area-mean as a function of depth in the ocean is therefore formed over different areas at different depths. This is not indicated explicitly in cell_methods. As described in Section 7.3.4 "Use of standard names in cell methods", a labeled axis of region may restrict the portion of a latitude-longitude gridbox to be considered.

If the method is mean, the cell_methods entry may be further supplemented by the phrase "over type", where type can be land, sea or all, and all means the entire area of the cell. A cell_methods entry of the form "mean where type1 over type2" indicates the mean is calculated by summing over the type1 portion of the cell and dividing by the area of the type2 portion. A cell_methods entry of the form "mean over type" indicates the mean is calculated by summing over the entire cell and dividing by the area of the type portion.

Example. Thickness of sea-ice and snow on sea-ice averaged over sea area.

variables:
  float snow_thickness(lat,lon);
    snow_thickness:cell_methods="area: mean where sea_ice over sea";
    snow_thickness:standard_name="lwe_thickness_of_surface_snow_amount";
    snow_thickness:units="m";
  float sea_ice_thickness(lat,lon);
    sea_ice_thickness:cell_methods="area: mean over sea";
    sea_ice_thickness:standard_name="sea_ice_thickness";
    sea_ice_thickness:units="m";

In the case of sea-ice thickness, it makes no difference to include "where sea_ice", since the sum over all sea area of sea-ice thickness is obviously the same as the sum over sea-ice area only. In the case of snow thickness, the "where" phrase does make a difference; it excludes snow on land from the average. Omitting the "over" phrase would mean that both quantities would be averages over the entire cell, not just the sea area.

  1. Modify the first bullet of the section on 7.3 in the conformance document

as follows:

The type of the cell_methods attribute is a string whose value is one or more blank separated word lists, each with the form

dim1: [dim2: [dim3: ...] ] method [where type1] [over type2] [within|over days|years] [(comment)]

where brackets indicate optional words. The valid values for dim1 [dim2 [dim3 ...] ] are dimension names of the associated variable, valid standard names, or the word area. The valid values of method are contained in Appendix D. The valid values for type1 are land, sea, sea_ice, or open_sea. The valid values for type2 are land, sea and all. When the method refers to a climatological time axis, the suffixes for within and over may be appended.

5. Benefits

These clarifications will particularly benefit those providing or using data
from models, when it is important to be clear exactly how area-means have been
calculated. The current standard is unclear.

6. Status Quo

The necessary clarification could be recorded as a comment in () in the
cell_methods. This is not usually done, and even if it were done, generic
applications could not use it to distinguish the possibilities as it is not
standardised. For the CMIP3 database, PCMDI described how means should be
calculated e.g. whether sea ice thickness is calculated as the mean over sea
ice area or some other area; the prescription was not recorded in the netCDF
data produced, which is therefore not self-describing.

Attachments (1)

cell_methods_june19.pdf (52.0 KB) - added by taylor13 6 years ago.
#17 cell_methods text changes (word format)

Download all attachments as: .zip

Change History (37)

comment:1 Changed 7 years ago by jonathan

  • Summary changed from Remove ambiguity in by cell_methods, especially means over subgrid areas to Remove ambiguity in cell_methods, especially means over subgrid areas

comment:2 follow-up: Changed 7 years ago by russ

I believe "The valid values of method are contained in Appendix D." should be changed to reference Appendix E instead.

I'm not sure what change is being described in "8. Modify the first bullet of the section on 7.3 in the conformance document as follows:". The first bullet I see in section 7.3 is

If the cell coordinate range cannot be precisely defined. For example, the Levitus ocean climatology uses any data that exists. It is a time mean but the time range is not well defined, so cannot be stated.

Is it really intended to delete this bullet and instead substitute the three paragraphs starting "The type of the cell_methods attribute is a string whose value ..."? If so, the substitute text does not seem to fit with the text preceding the bullet, which says "There are two reasons for doing this."

If the problems above are fixed, I approve these suggested changes. They seem complex, but clarify the intended use of cell_methods.

comment:3 in reply to: ↑ 2 Changed 7 years ago by jonathan

Dear Russ

Thanks for looking at this proposal.

I believe "The valid values of method are contained in Appendix D." should be changed to reference Appendix E instead.

You are right. The conformance document currently says D, but it should be E.

I'm not sure what change is being described in "8. Modify the first bullet of the section on 7.3 in the conformance document as follows:".

This part of the proposal (item 8) is dealing with changes to the conformance document http://cf-pcmdi.llnl.gov/conformance/requirements_recommendations/, not the CF standard document. Item 8 is changes to the conformance document to make it consistent with the modified standard document. I hope that answers your concern.

If the problems above are fixed, I approve these suggested changes.

Thanks.

They seem complex, but clarify the intended use of cell_methods.

They are complex to describe as a change, but I suspect they will not look so complicated when in place in the documents.

Cheers

Jonathan

comment:4 Changed 7 years ago by russ

Yes that addresses my concerns, thanks.

I'm embarrassed to admit, I wasn't even aware of the existence of the
"CF Conformance Requirements and Recommendations" document. Now that
I've looked at it, it seems useful. We should consider including a
link to it from our Conventions page in addition to the link to the CF
Conventions document.

--Russ

comment:5 follow-up: Changed 7 years ago by taylor13

I generally support this change to the convention. There are a few items that I hope others will provide input on before we move forward:

1) With regard to 4.3 in the original ticket posting: I'm a little confused. It says "area_type, whose values could be any of ...." Do you mean "area_type *which identifies variables* whose values could be any ..."?

2) The proposed new title for section 7.3 "Statistical variation within cells" is more descriptive than "Cell methods", but it seems inappropriate to characterize a "mean", which may be the most common cell method, as a "variation within cells". Would one of the following be more accurate? a) "Statistics characterizing cell quantity", or b) "Characterization of quantity within cells", or c) "Characterization of cell quantity".

3) In 4.7 of the the original ticket posting: Concerning the second convention, I think "cloud" should be one of the allowable "types" along with land, sea, sea_ice, and open_sea. I've seen, for example, model output with cloud water averaged over the entire grid cell and just over the cloud covered portion of the grid cell. This is common enough to warrant not having to define a scalar dimension (following the first convention in this section).

4) I think it would be helpful in the Example to expand on the description: Mean surface temperature over land and sensible heat flux over land and sea. In any case, perhaps the variable name "land_cover2" should be renamed "land_sea" and land_sea should have a standard name attribute with the value "surface_cover" (not "land_cover").

5) In the paragraph following the first example, it would be helpful to mention that although "This is not indicated explicitly in cell_methods", the information can be stored by the use of bounds attached to scalar coordinates for longitude and latitude.

6) In the paragraph following the first example, I am confused by the statement beginning "As described in Section 7.3.4 ...." I also think we need to work on section 7.3 itself because it took me a long time to figure out what the point was, even though I knew this at one time (a few years ago).

7) I think we should add to the "types" allowed in the "over type" construction the following: open_sea, sea_ice, and cloud. Also, we should note that if the "over type" suffix is omitted, it is assumed to be over the area specified in the "where type" specification.

8) How would one specify "over cloud-free land"?

Thanks to you all for you patience. I hope to soon contribute to improving some of the parts I found confusing, rather than just saying they were confusing, but this is all for now.

Karl

comment:6 in reply to: ↑ 5 Changed 7 years ago by jonathan

Dear Karl

Thanks for your comments and support.

1) With regard to 4.3 in the original ticket posting: I'm a little confused. It says "area_type, whose values could be any of ...." Do you mean "area_type *which identifies variables* whose values could be any ..."?

This is the bit that says, "Define a new standard_name of area_type, whose values could be any of the surface_cover types." What I meant is that a variable (probably a coordinate variable) whose standard name is area_type may have any of the values that a variable whose standard name is surface_type could have, such as sea, sea_ice, land, vegetation, forest, broadleaf_tree. The possible values aren't standardised at present, as I commented. area_type is proposed to be more general than surface_type, because it could also be used for distinctions of horizontal area which are not surface types, such as cloud.

2) The proposed new title for section 7.3 "Statistical variation within cells" is more descriptive than "Cell methods", but it seems inappropriate to characterize a "mean", which may be the most common cell method, as a "variation within cells". Would one of the following be more accurate? a) "Statistics characterizing cell quantity", or b) "Characterization of quantity within cells", or c) "Characterization of cell quantity".

The whole of CF 7 is entitled "Data representative of cells", so maybe we could call 7.3 "Statistical methods" (a bit clearer than the current "Cell methods"). "Mean" is certainly a statistic.

3) In 4.7 of the the original ticket posting: Concerning the second convention, I think "cloud" should be one of the allowable "types" along with land, sea, sea_ice, and open_sea. I've seen, for example, model output with cloud water averaged over the entire grid cell and just over the cloud covered portion of the grid cell. This is common enough to warrant not having to define a scalar dimension (following the first convention in this section).

I sympathise, but I'm not convinced. I think we should try to limit the proliferation of area types that could receive this favoured treatment. Actually I had considered not allowing the shorthands (of the kind "area: mean over land") at all, but I thought that the land/sea/sea_ice/open_sea ones are particularly commonly used because there are submodels for land, ocean and sea ice. For these submodels, all output quantities might have to specify the area type, so it seems a worthwhile convenience to allow it to be done in cell_methods. In other cases, such as cloud, it is not necessary to define a dimension of size one. It can be done with a scalar coordinate variable, not requiring any dimension (except the length of the string).

4) I think it would be helpful in the Example to expand on the description: Mean surface temperature over land and sensible heat flux over land and sea. In any case, perhaps the variable name "land_cover2" should be renamed "land_sea" and land_sea should have a standard name attribute with the value "surface_cover" (not "land_cover").

Thanks - good suggestions.

Example. Mean surface temperature over land and sensible heat flux meaned separately over land and sea.

dimensions:
  lat=73;
  lon=96;
  maxlen=20;
  ls=2;
variables:
  float surface_temperature(lat,lon);
    surface_temperature:cell_methods="area: mean where land";
  float surface_upward_sensible_heat_flux(ls,lat,lon);
    surface_upward_sensible_heat_flux:coordinates="land_sea";
    surface_upward_sensible_heat_flux:cell_methods="area: mean";
  char land_sea(ls,maxlen);
    land_sea:standard_name="area_type";
data:
  land_sea="land","sea";

5) In the paragraph following the first example, it would be helpful to mention that although "This is not indicated explicitly in cell_methods", the information can be stored by the use of bounds attached to scalar coordinates for longitude and latitude.

That's not what I meant. The horizontal coordinates and their bounds are independent of level, but the actual area (of non-missing data) of the ocean is a function of level. Hence the "area: mean over sea" is implicitly a mean over an area which varies as a function of level.

6) In the paragraph following the first example, I am confused by the statement beginning "As described in Section 7.3.4 ...." I also think we need to work on section 7.3 itself because it took me a long time to figure out what the point was, even though I knew this at one time (a few years ago).

Perhaps this could be addressed by expanding the examples. I would propose the following text:

The convention of specifying a cell method for a standard name, rather than for the dimension of a coordinate variable, is to allow one to provide an indication that a particular cell method is relevant to the data without having to provide a precise description of the corresponding cell. There are two reasons for doing this.

  • If the cell coordinate range cannot be precisely defined. For example, the Levitus ocean climatology uses any data that exists. It is a time mean but the time range is not well defined, so cannot be stated. This could be indicated by a cell_methods entry of time: mean, where time appears as a standard_name, and there is no dimension or coordinate variable with this name.
  • For convenience, if the cell extends over all valid coordinates. This is permitted only for the standard names longitude and latitude. Methods specified for these standard names are assumed to apply to the complete range of longitude and latitude respectively. If in addition the data variable has a dimension with a corresponding labeled axis that specifies a geographic region Section 6.1.1, "Geographic Regions", the implied range of longitude and latitude is the valid range for each specified region. For example, there could be a cell_methods entry of longitude: mean, where longitude appears as standard_name and is not the name of a dimension or coordinate variable. That would indicate a mean over all longitudes. But if in addition the data variable had a scalar coordinate variable with a standard_name of region and a value of atlantic_ocean, it would indicate a mean over longitudes that lie within the Atlantic Ocean, not all longitudes.

7) I think we should add to the "types" allowed in the "over type" construction the following: open_sea, sea_ice, and cloud. Also, we should note that if the "over type" suffix is omitted, it is assumed to be over the area specified in the "where type" specification.

I'd be happy with open_sea and sea_ice for consistency with the set of types allowed after where. I don't think we should allow cloud, as in your point 3.

Your second point is what I mean by this text: A cell_methods entry of the form "mean over type" indicates the mean is calculated by summing over the entire cell and dividing by the area of the type portion.

8) How would one specify "over cloud-free land"?

At present, values of area_type are not standardised, so you could have a scalar coordinate variable with a standard_name of area_type and a value of cloud_free_land. If/when we standardise the area types, we'll have to decide a method for combining them with and/or, I suppose, or introduce further distinctions in standard names between surface types and non-surface types. For the moment, I think the unstandardised mechanism is sufficient to provide useful metadata, though I'm sure particular projects might propose standardisations for their own purposes.

Cheers

Jonathan

comment:7 follow-up: Changed 7 years ago by bnl

Just a few points (so far), I'm broadly in agreement, but finding it hard to digest in detail.

  1. I think cell_methods as a section title is more useful than any of the proposed alternatives. I'm not convinced this needs changing.
  2. But I agree that the subsections would add value.
  3. I think the introduction of area_type is a difficult thing to follow. That said, I think it's analogous to a cell_method itself, and we should control the vocabulary from the start. Would it be simpler to have the standard_name indicate what it's for straight away: rather than area_type, have cell_area_type?
    • And immediately add to the standard_names, the concept of allowed values (a controlled enumeration). I think the enumerated values themselves need definition (which is to some extent what we had when we had where clauses in the standard names).

Then

dimensions:
  lat=73;
  lon=96;
  maxlen=20;
  ls=2;
variables:
  float surface_temperature(lat,lon);
    surface_temperature:cell_methods="area: mean where land";
  float surface_upward_sensible_heat_flux(ls,lat,lon);
    surface_upward_sensible_heat_flux:coordinates="land_sea";
    surface_upward_sensible_heat_flux:cell_methods="area: mean";
  char land_sea(ls,maxlen);
    land_sea:standard_name="cell_area_type";
data:
  land_sea="land","sea";

and both "land" and "sea" exist and are defined in a controlled enumeration.

comment:8 in reply to: ↑ 7 Changed 7 years ago by jonathan

Dear Bryan

Thanks for your comments (so far).

I'm broadly in agreement

Good

but finding it hard to digest in detail.

Yes, it is not trivial!

I think cell_methods as a section title is more useful than any of the proposed alternatives. I'm not convinced this needs changing.

OK. I don't feel strongly.

I think the introduction of area_type is a difficult thing to follow. Would it be simpler to have the standard_name indicate what it's for straight away: rather than area_type, have cell_area_type?

I don't think that would be quite correct. The area_type is a property of part of the cell (that's the point of it when used as a coordinate) not the whole cell.

We should control the vocabulary from the start. Immediately add to the standard_names, the concept of allowed values (a controlled enumeration). I think the enumerated values themselves need definition.

OK. We could provide an initial list of possible values and accept proposals for new values via the email list, like standard names. For the initial list, I would propose land, sea, sea_ice, open_sea, cloud, clear_sky, vegetation, bare_ground, land_ice.

At the moment the standard doesn't say anything about controlled values for standard names, so I propose the following addition to Section 3.3, at the end of the paragraph which begins "The standard name table is located at":

Some standard names (e.g. region and area_type) are used to indicate quantities which are permitted to take only certain standard values. This is indicated in the definition of the quantity in the standard name table, accompanied by a list or a link to a list of the permitted values.

Cheers

Jonathan

comment:9 follow-up: Changed 7 years ago by jonathan

I wonder whether it would be a good idea to recommend that cell_methods always be included, even when the default (point for intensive, sum for extensive) applies? The CF-checker would then produce a warning for any data variable which did not have cell_methods for any of its dimensions. Would that be useful or irritating?

The advantages of making such a recommendation could include:

  1. The default might not be obvious, because it takes a bit of thought and knowledge to work out whether a quantity is intensive or extensive.
  1. Many quantities are actually area-means, so cell_methods should be included, but often is not.
  1. This is particularly important for quantities applying to only the land, or only the sea, portion of a grid-cell. Such means should routinely include a cell_methods entry of area: mean over land or area: mean over sea to make the intention clear.

Jonathan

comment:10 in reply to: ↑ 9 Changed 7 years ago by jonathan

Replying to jonathan:

  1. This is particularly important for quantities applying to only the land, or only the sea, portion of a grid-cell. Such means should routinely include a cell_methods entry of area: mean over land or area: mean over sea to make the intention clear.

Sorry, that should have been area: mean where land or area: mean where sea i.e. integrate over the land or sea area and then divide by that area. That is the correct description, for example, for the mean surface temperature in the sea portion of a grid-cell. The fact that I made this mistake indicates there's an ambiguity about where and over, so I'd like to suggest this slightly changed version of the proposed syntax for means:

mean [where type1 [over type2]]

i.e. over can appear only if where appears. type1 and type2 can be land, sea, sea_ice, open_sea or all. A cell_methods entry of the form "mean where type" indicates the mean is calculated by summing over the type portion of the cell and dividing by the area of this portion. A cell_methods entry of the form "mean where type1 over type2" indicates the mean is calculated by summing over the type1 portion of the cell and dividing by the area of the type2 portion. A cell_methods entry of the form "mean where all over type" indicates the mean is calculated by summing over the entire cell and dividing by the area of the type portion.

I hope that is clearer. I believe that all three forms are needed, though the last is less common.

comment:11 follow-up: Changed 7 years ago by Heinke

Dear Jonathan,
thank you for this proposal.
This looks very good but I like to add some
valid values for type1.
snow (e.g. surface_temperature where snow)
glaciers (e.g. surface albedo where glaciers)
canopy (e.g. snow_depth where canopy or C-Pool for fastly respirated soil organic material where canopy)
lake (or is sea=open_sea+lakes?)

Best whishes
Heinke

comment:12 follow-up: Changed 7 years ago by apamment

I support this proposal. The current convention of using where-phrases in standard names is ambiguous and a more precise method of describing subgrid statistical processing is needed.

The numbering in the following comments refers to the numbering in the original proposal.

4.2 I think the standard name aliases can be introduced without problem provided the explanations of the names without where-phrases are expanded to cover the use of the where-phrase in any existing data. In the case of precipitation_flux_onto_canopy_where_land a new name, precipitation_flux_onto_canopy, would first need to be introduced so that the alias can be created. That can easily be done.

4.3 There will be a number of standard names - surface_cover, land_cover and the proposed area_type - all essentially performing the same function, i.e. to identify a character or flag variable describing the nature of a portion of a grid_cell. Surface_cover is already a generalisation of land_cover and it seems to me that area_type is a further generalisation of the same concept. I suggest deleting surface_cover and land_cover and making them both aliases of area_type. Having one standard name for this type of information seems less confusing.

4.5 I agree with the suggestion to introduce the name of 'area' into cell methods to indicate horizontal area. I think it is clearer than the existing convention of specifying the dimensions separately and that it should become the recommended way of specifying the information. Thus I suggest changing the wording of the proposal from "To indicate variation over a horizontal area a special name of 'area' is permitted as an alternative ..." to "To indicate variation over a horizontal area it is recommended that a special name of 'area' is used ...".

4.6 I agree with Bryan's comment that it would be better to retain the section heading of "Cell methods" and that the introduction of sub-headings is useful.

4.7 In the second convention the use of 'where' and 'over' to specify which portions of the cell are being summed over and divided by is very clear and useful. I confess I don't really understand the need to restrict this convention to only the commonest cases of land, sea, sea_ice and open_sea - why not allow it for all area_types, especially if we follow the suggestion of standardising the allowed values of area_type?

I don't think that the general (first) convention is as clear as the second. The area_type is specified in a string-valued coordinate variable or string-valued scalar coordinate variable and I can see that this would be useful if many surface types exist in a single grid cell. However, is the area_type given in the coordinate variable the area that is being summed over or divided by (or both)? Unless I have misunderstood, it seems to me that this approach suffers from the same ambiguities as the existing standard names.

I agree with your proposal to recommend the use of cell_methods. My experience is that people don't find the distinction between intensive and extensive variables to be particularly obvious.

Alison

comment:13 in reply to: ↑ 11 Changed 6 years ago by jonathan

Dear Heinke

I like to add some valid values for type1

I would like to limit the number of "special" types that are allowed to appear in "where type" as I think this is just a shorthand for the commonest types. The shorthand is provided in order to avoid the need for a coordinate variable in those cases. Instead, I would suggest adding snow to the list of possible values of an area_type variable. I have already suggested land_ice in that list, which should be OK for glaciers (or do you want to distinguish glaciers from ice-sheets?).

I am not sure about "canopy". I think that for lying snow, we would distinguish between "surface snow" and "snow on canopy". You can have both in the same spot, so arguably these are not distinguished by area type. In your example of soil respiration, do you mean that the respiration rate may be distinguished between areas of ground beneath the canopy, and areas that can see the sky? In that case, I agree, we would need canopy as an area_type.

Rivers and lakes raise another difficulty which has come up before. We do not have standard names for lake water temperature, lake water velocity, lake ice thickness etc. and the same for rivers. In principle all the same quantities are needed for rivers and lakes as for sea, and it would ugly to copy them all, I think, especially as the distinction may not always be clear and could be inconvenient to draw among river, lake and sea. Can we understand "sea" to include river and lake? I can't think of a word which means "sea, river or lake"!

Best wishes

Jonathan

comment:14 in reply to: ↑ 12 Changed 6 years ago by jonathan

Dear Alison

4.2 I think the standard name aliases can be introduced without problem provided the explanations of the names without where-phrases are expanded to cover the use of the where-phrase in any existing data. In the case of precipitation_flux_onto_canopy_where_land a new name, precipitation_flux_onto_canopy, would first need to be introduced so that the alias can be created.

Thanks for noting that.

4.3 There will be a number of standard names - surface_cover, land_cover and the proposed area_type - all essentially performing the same function, i.e. to identify a character or flag variable describing the nature of a portion of a grid_cell. Surface_cover is already a generalisation of land_cover and it seems to me that area_type is a further generalisation of the same concept. I suggest deleting surface_cover and land_cover and making them both aliases of area_type. Having one standard name for this type of information seems less confusing.

OK, good idea.

4.5 I agree with the suggestion to introduce the name of 'area' into cell methods to indicate horizontal area. I think it is clearer than the existing convention of specifying the dimensions separately and that it should become the recommended way of specifying the information. Thus I suggest changing the wording of the proposal from "To indicate variation over a horizontal area a special name of 'area' is permitted as an alternative ..." to "To indicate variation over a horizontal area it is recommended that a special name of 'area' is used ...".

OK.

4.6 I agree with Bryan's comment that it would be better to retain the section heading of "Cell methods" and that the introduction of sub-headings is useful.

OK. I don't mind.

4.7 In the second convention the use of 'where' and 'over' to specify which portions of the cell are being summed over and divided by is very clear and useful. I confess I don't really understand the need to restrict this convention to only the commonest cases of land, sea, sea_ice and open_sea - why not allow it for all area_types, especially if we follow the suggestion of standardising the allowed values of area_type?
I don't think that the general (first) convention is as clear as the second. The area_type is specified in a string-valued coordinate variable or string-valued scalar coordinate variable and I can see that this would be useful if many surface types exist in a single grid cell. However, is the area_type given in the coordinate variable the area that is being summed over or divided by (or both)? Unless I have misunderstood, it seems to me that this approach suffers from the same ambiguities as the existing standard names.

The reason I propose to limit the use of the shorthand is that involves listing the possible values of keywords that can appear in cell_methods, and we should keep the number of possibilities to a minimum, I feel. All the keywords in the CF standard have a possible set of values which are listed explicitly in the standard. Of course, we could allow a reference to an external list, but the coordinate variable is a more general approach which already has that kind of machinery via the standard name table.

However, you are right that the coordinate variable approach doesn't have the same generality for means as the shorthand approach, so I now propose to generalise it thus: In the syntax "mean [where type1 [over type2]]", type1 and type2 can be one of the permitted values or the name of a coordinate or auxiliary coordinate variable with standard_name of area_type. Then all the same kinds of means can be evaluated, and since coordinate variables can be multivalued, it permits all sorts of combinations of A and B in evaluating the mean as A/B. Is that OK?

I agree with your proposal to recommend the use of cell_methods. My experience is that people don't find the distinction between intensive and extensive variables to be particularly obvious.

OK, good.

Cheers

Jonathan

comment:15 follow-up: Changed 6 years ago by jonathan

Many useful suggestions have been made, so here I restate the current form of the proposed changes to the documents following the discussion in this ticket. This supersedes section 4 (technical statement of the proposal) of the first entry of this ticket. In responding to Alison's suggestion, I have changed the proposal so that there is always a where-phrase in the cell_methods when a method applies to area-types.

  1. Delete the existing standard names with where-phrases, making them aliases of names without the where-phrases:
precipitation_flux_onto_canopy_where_land
surface_net_downward_radiative_flux_where_land
surface_snow_thickness_where_sea_ice
surface_temperature_where_land
surface_temperature_where_open_sea
surface_temperature_where_snow
surface_upward_sensible_heat_flux_where_sea
water_evaporation_flux_from_canopy_where_land
water_evaporation_flux_where_sea_ice

The standard name precipitation_flux_onto_canopy is an addition to the table. The others exist already.

  1. In Section 3.3, add the following to the end of the paragraph which begins "The standard name table is located at":

Some standard names (e.g. region and area_type) are used to indicate quantities which are permitted to take only certain standard values. This is indicated in the definition of the quantity in the standard name table, accompanied by a list or a link to a list of the permitted values.

  1. Define a new standard_name of area_type, to indicate distinctions of horizontal area within cells. Make the existing standard names of land_cover and surface_type into aliases of area_type. Variables with standard_name of area_type must have one of the values listed in a special table. The initial list of values in that table is: land sea sea_ice open_sea cloud clear_sky vegetation bare_ground land_ice snow. Additions to this table will be requested on the CF email list.
  1. To provide for greater use of string-valued auxiliary coordinate variables, especially string-valued scalar coordinate variables:
  • To the end of the first paragraph of 6.1, append: Other purposes for string identifiers are also described in Section 6.1.1, "Geographic Regions", and Section 7.3.3, "Statistics applying to portions of cells".
  • To the end of the second paragraph of 6.1, append: If a character variable has only one dimension (the length of the string), it is regarded as a string-valued scalar coordinate variable, analogous to a numeric scalar coordinate variable (Section 5.7).
  • Modify the section on 6.1 in the conformance document to read: A variable of character type that is named by a coordinates attribute is a label variable. This variable must have one or two dimensions. The trailing (CDL order) or sole dimension is for the maximum string length. If there are two dimensions, the leading dimension (CDL order) must match one of those of the data variable.
  1. Append to the second paragraph of 7.3, which begins "The default interpretation ...", as follows: "Usually cell values for intensive quantities are means, and this should be indicated explicitly in cell_methods. Because of this, and because the default interpretation, depending on the distinction between intensive and extensive, may not be obvious to the user of the data, it is recommended that every data variable should have a cell_methods attribute containing an entry for each of its dimensions and scalar coordinate variables for which it is meaningful. It is especially recommended for spatiotemporal dimensions and scalar coordinate variables."
  1. Modify the paragraph in 7.3 beginning "If a data value is representative of variation over a combination of axes" by changing "a longitude-latitude gridbox would have" to "... could have", and appending the following:

To indicate variation over horizontal area, it is recommended to use the special name of area instead of specifying a combination of dimensions. The common case of an area-mean in longitude-latitude gridboxes can thus be shown by cell_methods="area: mean". If they are not longitude and latitude, the horizontal coordinate variables can be identified with axis attributes of X and Y (see Chapter 4, Coordinate Types).

  1. Insert subsection headings of 7.3.1 Statistics for more than one axis starting with the paragraph "If more than one ...", 7.3.2 Recording the spacing of the original data and other information starting "To indicate more precisely" and 7.3.4 Cell methods specified when there are no coordinates starting "The convention of specifying". (A new subsection on portions of cells will be inserted as 7.3.3 between the second and third existing subsections.)
  1. Insert a new subsection in CF 7.3 entitled Statistics applying to portions of cells before the paragraph beginning "The convention of specifying", as follows:

By default, the statistical method indicated by cell_methods is assumed to have been evaluated over the entire cell. Sometimes it is necessary to evaluate different values of a quantity for different portions of a cell. To indicate this, one of two conventions may be used.

The first convention is the more general. In this convention, a string-valued auxiliary coordinate variable or string-valued scalar coordinate variable (see Section 6.1, "Labels") with a standard_name of area_type indicates the portion of the cell. With this approach, a coordinate variable with dimension greater than one would allow values of a quantity to be given for various area types in one data variable, as is often needed in land surface models for example, since they deal with many types within each surface gridbox. In this convention, the cell_methods entry is of the form "name: method where typevar", where name could be area and typevar is the name of the area_type variable. The statistical method applies to the selected portion of the cell only e.g. a mean over the sea-ice area.

The second convention is a shorthand for the commonest cases. In this convention, a cell_methods entry may be given of the form "name: method where type", in which type may be land, sea, sea_ice, or open_sea (sea area not occupied by sea ice). The phrase "where type" should be interpreted as exactly equivalent to supplying a scalar or size-one coordinate variable of area_type with value type. If type is also the name of a variable in the netCDF file, it is nonetheless interpreted as a keyword, i.e. the second convention takes precedence over the first convention.

Example. Mean surface temperature over land and sensible heat flux meaned separately over land and sea.

dimensions:
  lat=73;
  lon=96;
  maxlen=20;
  ls=2;
variables:
  float surface_temperature(lat,lon);
    surface_temperature:cell_methods="area: mean where land";
  float surface_upward_sensible_heat_flux(ls,lat,lon);
    surface_upward_sensible_heat_flux:coordinates="land_sea";
    surface_upward_sensible_heat_flux:cell_methods="area: mean where ls";
  char land_sea(ls,maxlen);
    land_sea:standard_name="area_type";
data:
  land_sea="land","sea";

In any case, other coordinate variables may also implicitly restrict the portion of the cell considered by the statistical method. For example, the horizontal area of the ocean decreases with increasing depth. An area-mean as a function of depth in the ocean is therefore formed over different areas at different depths. This is not indicated explicitly in cell_methods. As described in Section 7.3.4 "Cell methods specified when there are no coordinates", a labeled axis of region may restrict the portion of a latitude-longitude gridbox to be considered.

If the method is mean, the cell_methods entry may take the form "mean where type1 [over type2]", in which each of type1 and type2 can be either land, sea, sea_ice, open_sea or all, where all means the entire area of the cell, or the name of a string-valued auxiliary coordinate variable or string-valued scalar coordinate variable with standard_name of area_type. A cell_methods entry of the form "mean where type1" indicates the mean is calculated by summing over the type1 portion of the cell and dividing by the area of this portion. A cell_methods entry of the form "mean where type1 over type2" indicates the mean is calculated by summing over the type1 portion of the cell and dividing by the area of the type2 portion. A cell_methods entry of the form "mean where all over type2" indicates the mean is calculated by summing over the entire cell and dividing by the area of the type2 portion.

Example. Thickness of sea-ice and snow on sea-ice averaged over sea area.

variables:
  float sea_ice_thickness(lat,lon);
    sea_ice_thickness:cell_methods="area: mean where sea_ice over sea";
    sea_ice_thickness:standard_name="sea_ice_thickness";
    sea_ice_thickness:units="m";
  float snow_thickness(lat,lon);
    snow_thickness:cell_methods="area: mean where sea_ice over sea";
    snow_thickness:standard_name="lwe_thickness_of_surface_snow_amount";
    snow_thickness:units="m";

In the case of sea-ice thickness, "where all" would have the same effect as "where sea_ice", since the integral over all area of sea-ice thickness is obviously the same as the integral over sea-ice area only. In the case of snow thickness, they differ because "where sea_ice" excludes snow on land from the average. Omitting the "over" phrase would mean that both quantities would be averages over the entire cell, not just the sea area.

  1. Replace the two bullets near the end of the current 7.3, in the newly numbered subsection 7.3.4, introduced by the paragraph beginning "The convention of specifying ...", with the following expanded text:
  • If the cell coordinate range cannot be precisely defined. For example, the Levitus ocean climatology uses any data that exists. It is a time mean but the time range is not well defined, so cannot be stated. This could be indicated by a cell_methods entry of time: mean, where time appears as a standard_name, and there is no dimension or coordinate variable with this name.
  • For convenience, if the cell extends over all valid coordinates. This is permitted only for the standard names longitude and latitude, and for the word area, standing for the combination of longitude and latitude i.e. the horizontal coordinates. Methods specified for longitude and latitude are assumed to apply to the complete range of longitude and latitude respectively. Methods specified for area apply to the whole world. Note that these uses of cell_methods are allowed only if there are no coordinate or auxiliary coordinate variables of longitude, latitude or horizontal dimensions, respectively. If in addition the data variable has a dimension with a corresponding labeled axis that specifies a geographic region (Section 6.1.1, "Geographic Regions"), the implied range of longitude and latitude is the valid range for each specified region. For example, there could be a cell_methods entry of longitude: mean, where longitude appears as standard_name and is not the name of a dimension or coordinate variable. That would indicate a mean over all longitudes. But if in addition the data variable had a scalar coordinate variable with a standard_name of region and a value of atlantic_ocean, it would indicate a mean over longitudes that lie within the Atlantic Ocean, not all longitudes.
  1. Modify the first bullet of the section on 7.3 in the conformance document

as follows:

The type of the cell_methods attribute is a string whose value is one or more blank separated word lists, each with the form

dim1: [dim2: [dim3: ...] ] method [where type1 [over type2]] [within|over days|years] [(comment)]

where brackets indicate optional words. The valid values for dim1 [dim2 [dim3 ...] ] are the names of dimensions of the data variable, names of scalar coordinate variables of the data variable, valid standard names, or the word area. The valid values of method are contained in Appendix E. The valid values for type1 and type2 are land, sea, sea_ice, open_sea, all or the name of a string-valued auxiliary or scalar coordinate variable with a standard_name of area_type. When the method refers to a climatological time axis, the suffixes for within and over may be appended.

  1. Add a recommendation for section 7.3 in the conformance document as follows: "If a data variable has any dimensions or scalar coordinate variables referring to horizontal, vertical or time dimensions, it should have a cell_methods attribute with an entry for each of these spatiotemporal dimensions or scalar coordinate variables. (The horizontal dimensions may be covered by an area entry.)"

comment:16 Changed 6 years ago by lowry

Jonathan,

I have been using the term 'water column' in our vocabularies to identify any body of salt or fresh water. The term 'hydrosphere' has been used for the same purpose elsewhere.

I would welcome CF standard names embracing the concept of 'water column' under any label as it would firm up the mappings between the BODC vocabularies being used by SeaDataNet? and CF (although it would weaken mappings between CF and GCMD who clearly differentiate between salt and fresh water). However, although using the word 'sea' for this concept would overcome legacy issues I worry about the potential for confusion.

comment:17 follow-up: Changed 6 years ago by Heinke

Dear Jonathan,

I would like to limit the number of "special" types

I agree totally.

(or do you want to distinguish glaciers from ice-sheets?).

Yes, because we need it for the albedo.

In your example of soil respiration, do you mean that the respiration rate

may be distinguished between areas of ground beneath the canopy, and areas
that can see the sky? In that case, I agree, we would need canopy as an
area_type.

This is correct, they distinguished ground and canopy because the canopy
changes in time and space.

In principle all the same
quantities are needed for rivers and lakes as for sea, and it would ugly
to copy them all, I think, especially as the distinction may not always be
clear and could be inconvenient to draw among river, lake and sea. Can we
understand "sea" to include river and lake? I can't think of a word which
means "sea, river or lake"!

I think we are at the beginning to model river and lake temperature....
I followed the discussion about hydrosphere. I like to discuss it with some people
in my institute. May be we have an answer.

Best wishes

Heinke

comment:18 in reply to: ↑ 17 Changed 6 years ago by jonathan

Dear Heinke

(or do you want to distinguish glaciers from ice-sheets?).

Yes, because we need it for the albedo.

OK: we can have land_ice, glacier, ice_sheet.

In your example of soil respiration, do you mean that the respiration rate

may be distinguished between areas of ground beneath the canopy, and areas
that can see the sky? In that case, I agree, we would need canopy as an
area_type.

This is correct, they distinguished ground and canopy because the canopy
changes in time and space.

OK, we can include canopy. I suppose that "canopy" means that leaf area index is not zero - is that right? There is no canopy if there is grass alone. If grass is included too, there is no distinction between areas with vegetation and areas with canopy, and in that case we don't need both.

Cheers

Jonathan

comment:19 Changed 6 years ago by jonathan

I would like to modify the proposal to use the phrase ice_free_sea instead of open_sea as the "opposite" of sea_ice, as I think it would be clearer.

We have discussed the issue of how to name "river", "lake" and "sea" together on the email list. This is not resolved. It's not a new issue and it does not affect the logic of the present proposal, so I would suggest that we stick to sea for this proposal, and deal with that problem subsequently.

Jonathan

comment:20 follow-up: Changed 6 years ago by Heinke

Dear Jonathan,

OK, we can include canopy. I suppose that "canopy" means that leaf area

index is not zero - is that right? There is no canopy if there is grass
alone. If grass is included too, there is no distinction between areas
with vegetation and areas with canopy, and in that case we don't need
both.

The model has different types of vegetation. One type is grass.
But for the model output they don't distinguish between grass and the other
types directly. Processes in the model need the different types.
Grass has leaves and are part of the leaf area index.
So, they can live with the term vegetation alone.

Cheers
Heinke

comment:21 in reply to: ↑ 20 Changed 6 years ago by jonathan

Dear Heinke

Grass has leaves and are part of the leaf area index.
So, they can live with the term vegetation alone.

OK, thanks for that clarification. However, we can revisit this if necessary. The values of area_type are a matter for discussion on the email list as part of a "controlled vocabulary", and don't affect the convention itself.

Best wishes

Jonathan

comment:22 Changed 6 years ago by taylor13

Dear all,

As moderator of this ticket (#17) and with the intention of closing this ticket, I have finally studied all of the discussion that went on. The official period for discussion expired long ago, and no strong objections were expressed concerning the final form of the proposal, summarized in Jonathan’s posting of 2/2/08. I would like Jonathan to consider, however, responding to a few points before we actually adopt the proposal. Also, I have taken the liberty of editing the text modifications suggested by Jonathan to improve clarity and readability (I hope). At http://cf-pcmdi.llnl.gov/trac/wiki you can find the word document with the marked up text of section 7.3, first with Jonathan’s changes according to his 2/2/08 posting, and then in a different color, my editing . [The editing was done with “track changes” turned on. To download click on cell_methods.doc, and then click on "download the file".] The word document addresses points 4-9 of Jonathan’s posting. If you have editorial suggestions, please send them to me (or post them).

Jonathan nicely incorporated most of the suggestions that came out of the CF discussion of this ticket. I think the final proposal is therefore in very good shape. I only think it would be good to reconsider the merits of Alison’s suggestion to allow use of all the standard values of area_type in the “second convention” discussed in the new section 7.3.3 (instead of restricting these to land, sea, sea_ice, or ice_free_sea). Jonathan argued that he didn’t want to have to point to an external list from the CF standard document. We expect the list to grow as new needs are identified, so I have some sympathy. But I think having different lists apply for the first convention and the second convention complicates things and will more likely lead to user error in recording the information. I therefore suggest that the current list of permissible values be extended to include: land, sea, sea_ice, ice_free_sea, cloud, clear_sky, vegetation, bare_ground, land_ice, and all.

There was discussion of adding at least three other area_type values: canopy, glacier, and ice_sheet. If we later reach a consensus concerning the need for these, they could be added.

Concerning the other modifications summarized in Jonathan’s 2/2/08 posting, there seems to be no objection.

Best wishes,
Karl

comment:23 Changed 6 years ago by jonathan

Dear Karl

Thanks for reviewing the ticket and studying the proposal. For completeness I note that points 1-4 and 10-11 of my earlier posting should also be included in the changes to be made if this proposal is accepted. Point 2 modifies section 3.3 of the CF standard, point 4 modifies section 6.1, points 10 and 11 modify the conformance document, and points 1 and 3 modify the standard name table. Your word doc covers the other points 5-9, which all modify section 7.3. I have uploaded a modified version of your word doc to http://cf-pcmdi.llnl.gov/trac/wiki.

I am generally happy with your edits. I have reinstated a couple of bits you deleted, with comments; perhaps you could revisit these. In other places I have tried a different wording to meet the difficulties you pointed out.

I included the sentence about identification of horizontal coordinate variables in 7.3.1 because an application may wish to find out which ones they are. Normally cell_methods names the dimensions explicitly, but if cell_methods says "area", the application has to use other rules for identifying horizontal coordinate variables. I have reinstated the sentence you removed, but modified the text a bit more in an attempt to make the purpose of it more obvious. Is it any better?

I am still not happy about allowing a longer list for possible area-types in the shorthand convention. If we did that, it would lead to future debates about what should be added to that list. Such debates could be time-consuming and difficult to settle with no clear criterion, and also such a list requires modification of the convention document, which is a harder procedure. The list I proposed of land, sea, sea_ice and ice_free_sea reflects the phrases that are currently used in where-clauses in existing standard names which will be removed under the proposal. However, I see your point about possible confusion between the two conventions. My alternative proposal would be to allow *any* permitted value of the area_type standard name in the shorthand convention. It would not then be just for the commonest area-types, but could be used for any case where a single area-type was required. I am not entirely happy about this alternative because it amounts to including coordinate value information in cell_methods, where I don't think it is really appropriate, but I wonder what you think of it.

Best wishes

Jonathan

Changed 6 years ago by taylor13

#17 cell_methods text changes (word format)

comment:24 Changed 6 years ago by taylor13

Only Jonathan has responded to my last posting for ticket #17, but I have been working to improve the wording in several places (with considerable help from Jonathan). As this rewording hasn’t affected the substance of the proposed changes, I have not posted my correspondences with Jonathan. The result of this work is available in the form of a marked-up word file, (also turned into a pdf file) at http://cf-pcmdi.llnl.gov/trac/wiki. Changes relative to the current CF standard are shown. Those highlighted in yellow are needed to implement this ticket. The other changes are an attempt to improve wording or in a few cases correct obvious mistakes in the original text. If there are no objections posted by July 10, the proposed changes will be accepted and this ticket will be closed.

There is one substantive change from the previous posted version -- namely that for the common case of data for a single area_type (e.g., surface temperature for land only), the proposal is to allow any of the valid area types be specified in the construction “name: method where type”, rather than limiting type to “land”, “sea”, “sea_ice”, or “ice_free_sea”, which was in the most recent version of the proposal. Alison originally proposed this extension, and Jonathan and I are now convinced that this will make it less confusing for the data writers.

Another somewhat important change is that I suggest that “all_area_types”, rather than “all”, should be used to indicate that all types of area have been included in the analysis. The reason for replacing “all” with a more specific descriptor (i.e., “all_area_types”) is to anticipate that we might in the future want to extend the cell_methods convention to describe other types of partial cells [perhaps, e.g., geographic regions (like “where atlantic ocean”), or certain “flag” types (like “where quality_good”)] . In these cases, we might also like to indicate “where all”, but this would be confusing if “all” could refer also to area_type. To distinguish among potentially several different “where” constructs, I propose that we use “all_area_types” now (which will allow us to distinguish it unambiguously from “all_regions” or “all_flags”, if we want to add options like these in the future). This will make any future extensions less problematic. [Jonathan’s views may differ on this.]

Finally, I would prefer that we carefully consider the initial list of valid values for the official types of area (like “land”, “sea”, “vegetation”, etc.). These are not standard_names but a restricted list of strings that can be assigned to a variable identified by the standard_name, “area_type”. Since new valid strings of this sort can be added through the standard name procedure (which isn’t too onerous), I suggest we be conservative and only initially approve here the following list: “land”, “sea”, “sea_ice”, “ice_free_sea”, land_ice, “snow”, “cloud”, “clear_sky”, “vegetation”, and “bare_ground”, and “all_area_types”. This list omits “glacier”, “ice_sheet”, and “canopy”. The distinction between glacier, ice_sheet, and land_ice seems pretty subtle. This is just a conservative suggestion, which I don’t feel that strongly about. In any case, it shouldn’t hold up the implementation of this ticket if there is disagreement.

For completeness, I note that points 1-4 and 10-11 of Jonathan’s earlier posting should also be included in the changes to be made if this proposal is accepted. Point 2 modifies section 3.3 of the CF standard, point 4 modifies section 6.1, points 10 and 11 modify the conformance document, and points 1 and 3 modify the standard name table.

comment:25 in reply to: ↑ 15 Changed 6 years ago by jonathan

Dear all

I am grateful to Karl for working on this proposal. I have posted a slightly altered version at http://cf-pcmdi.llnl.gov/trac/wiki, in which I have

  • corrected a typo.
  • changed 7.3.4 so that it is not highlighted, except for the parts about the new area syntax, because that is the only substantive change to the subsection; the rest is a rewriting of the existing convention.

In the pdf version (both Karl's and mine) some underscores in yellow parts are not evident, but they are there in the doc version.

I propose that if this change is adopted, only the highlighted parts should appear as provisional in the new edition of the CF conventions document. As Karl explains, the rest are not substantive changes. They could have been proposed in a separate ticket as a "defect" rather than an "enhancement", in which case they would take immediate effect, as they don't change the standard in effect.

Karl has proposed to omit glacier and ice_sheet from the initial list of area_type values. I included them at Heinke's request, and Heinke might comment. But I agree with Karl that we shouldn't hold up the implementation whether or not we include them, because they are a vocabulary change, not a change to the convention.

In the original proposal I had all, which Karl has changed to all_area_types. I still prefer all on grounds of simplicity. I think that if we decided the broaden the use of where later, we could rename all if necessary. However I am more concerned to have an agreement as soon as we can than to hold out over this detail!

Mostly for completeness and convenience, I repeat the rest of the changes below. The last two points below, on the changes to the conformance document, are different from the earlier versions, because of the changes to the proposal.

Cheers

Jonathan

  1. Delete the existing standard names with where-phrases, making them aliases of names without the where-phrases:
precipitation_flux_onto_canopy_where_land
surface_net_downward_radiative_flux_where_land
surface_snow_thickness_where_sea_ice
surface_temperature_where_land
surface_temperature_where_open_sea
surface_temperature_where_snow
surface_upward_sensible_heat_flux_where_sea
water_evaporation_flux_from_canopy_where_land
water_evaporation_flux_where_sea_ice

The standard name precipitation_flux_onto_canopy is an addition to the table. The others exist already.

  1. In Section 3.3, add the following to the end of the paragraph which begins "The standard name table is located at":

Some standard names (e.g. region and area_type) are used to indicate quantities which are permitted to take only certain standard values. This is indicated in the definition of the quantity in the standard name table, accompanied by a list or a link to a list of the permitted values.

  1. Define a new standard_name of area_type, to indicate distinctions of horizontal area within cells. Make the existing standard names of land_cover and surface_type into aliases of area_type. Variables with standard_name of area_type must have one of the values listed in a special table. The initial list of values in that table is: land sea sea_ice ice_free_sea land_ice snow cloud clear_sky vegetation bare_ground all_area_types. Additions to this table will be requested on the CF email list.
  1. To provide for greater use of string-valued auxiliary coordinate variables, especially string-valued scalar coordinate variables:
  • To the end of the first paragraph of 6.1, append: Other purposes for string identifiers are also described in Section 6.1.1, "Geographic Regions", and Section 7.3.3, "Statistics applying to portions of cells".
  • To the end of the second paragraph of 6.1, append: If a character variable has only one dimension (the length of the string), it is regarded as a string-valued scalar coordinate variable, analogous to a numeric scalar coordinate variable (Section 5.7).
  • Modify the section on 6.1 in the conformance document to read: A variable of character type that is named by a coordinates attribute is a label variable. This variable must have one or two dimensions. The trailing (CDL order) or sole dimension is for the maximum string length. If there are two dimensions, the leading dimension (CDL order) must match one of those of the data variable.
  1. Modify the first bullet of the section on 7.3 in the conformance document

as follows:

The type of the cell_methods attribute is a string whose value is one or more blank separated word lists, each with the form

dim1: [dim2: [dim3: ...] ] method [where type1 [over type2]] [within|over days|years] [(comment)]

where brackets indicate optional words. The valid values for dim1 [dim2 [dim3 ...] ] are the names of dimensions of the data variable, names of scalar coordinate variables of the data variable, valid standard names, or the word area. The valid values of method are contained in Appendix E. The valid values for type1 are the name of a string-valued auxiliary or scalar coordinate variable with a standard_name of area_type, or any string value allowed for a variable of standard_name of area_type. If type2 is a string-valued auxiliary coordinate variable, it is not allowed to have a leading dimension (the number of strings) of more than one. When the method refers to a climatological time axis, the suffixes for within and over may be appended.

  1. Add recommendations for section 7.3 in the conformance document as follows

If a data variable has any dimensions or scalar coordinate variables referring to horizontal, vertical or time dimensions, it should have a cell_methods attribute with an entry for each of these spatiotemporal dimensions or scalar coordinate variables. (The horizontal dimensions may be covered by an area entry.)

Except for entries whose cell method is point, all numeric coordinate variables and scalar coordinate variables named by cell_methods should have bounds or climatology attributes.

comment:26 Changed 6 years ago by taylor13

Thanks to Jonathan for proposing how to modify the "conformance" document to conform with the proposal.

Concerning the keyword "all" as opposed to "all_area_types". Jonathan stated: "I think that if we decided t[o] broaden the use of "where" later, we could rename "all" if necessary." I think this could only be done if we didn't want to use "all" for a more general purpose in the future. For example, if we were to generalize "where" to support individual regions as well as area_types, then we might want to support both "all_regions" and "all_area_types, and then allow logical constructions like "all_area_types and all_regions". It would be confusing to see "all and all_regions", and in fact one might want to designate "all = all_area_types and all_regions". This is why I prefer the more specific designation for "all_area_types".

comment:27 follow-up: Changed 6 years ago by Heinke

Dear Jonathan,


Karl has proposed to omit glacier and ice_sheet from the initial list
of area_type values. I included them at Heinke's request, and Heinke
might comment. But I agree with Karl that we shouldn't hold up the
implementation whether or not we include them, because they are a
vocabulary change, not a change to the convention.

I agree with.you. Let's discuss this later.

remarks to 7.3.3

should we use subheadings

name: method where type

name: method where type_var

area: mean where type1 [over type2]

Do we like to comment the situation e.g the portion of sea-ice is 0%.
The mean should not be '0'. We need a special value for this situation.

What do you think ?

Best regards
Heinke

comment:28 in reply to: ↑ 27 Changed 6 years ago by jonathan

Dear Heinke

should we use subheadings [in 7.3.3]

I find it clear enough without subheadings myself. I don't think that we use fourth level headings elsewhere.

Do we like to comment the situation e.g the portion of sea-ice is 0%.
The mean should not be '0'. We need a special value for this situation.

I agree the mean might not be meaningful in such a situation, but I would suggest that this is not necessarily a problem. You can imagine situations where this could happen. Suppose, for instance, that a land-surface model has a variable (area_type,lat,lon) which specifies some property of the surface e.g. albedo, or a vegetation parameter, for different area types as a function of geographical position, in a model with interactive vegetation. In such a model at any given time some of the possible area_types might actually have zero area in a given gridbox, but it is still meaningful to define what properties they would have if present.

Best wishes

Jonathan

What do you think ?

Best regards
Heinke

comment:29 Changed 6 years ago by taylor13

I agree with Jonathan that sometimes even though some quantity might never be associated with a certain surface type, it still might be appropriate to set the quantity to 0.0 over that surface.

For example, for the CMIP3 archive, we advised the modeling groups that:
soil_moisture_content might be either set to 0.0 or flagged as "unavailable" (i.e., missing) over grid cells with ocean. Either of these would accurately reflect the situation. Similarly, we suggested that surface_snow_thickness should be set to 0. in the absence of any snow, regardless of surface type (even though snow depth could never by non-zero over ice-free ocean.)

In contrast, one might require that soil_temperature be flagged as "unavailable" (i.e., missing) over oceans (rather than setting it to 0.0), since soil_temperature cannot be defined in a region where there is no soil.

comment:30 Changed 6 years ago by Heinke

Sorry, I think that my sea-ice example was not so clear.
How can we distinguish between 'real' mean '0' for example for
eastward_sea_water_velocity (values are negative and positiv) and
the value is unavailable and set to '0'.
This is not a problem for exclusive positiv values (e.g. surface_snow_thickness).
For further statistics (global mean ...) this must be distinguishable.
Our modeller are aware of this problem, but should we accomplish
this in our description.
May be this is to mathematical for the documentation. I only want to
discuss this problem.

Best regards
Heinke

comment:31 Changed 6 years ago by taylor13

Yes, this is potentially confusing. eastward_sea_water_velocity is somewhat similar to my example of soil_temperature. Sea_water velocity cannot be defined for cells without any sea, so I would recommend it be flagged as "missing". Nevertheless, if one had the land-sea mask available for the model, then the mask could be used to eliminate land values no matter what value was assigned to them before computing, say, a global mean.

I'm not sure we should legislate it, but I would hope that where a variable is undefined (because the area type it refers to is missing), then it should be identified as "missing" (rather than storing a 0.)

comment:32 Changed 6 years ago by apamment

Dear Jonathan, Karl and Heinke

I have finally caught up with the discussion of this ticket and I'd like to add a few further comments. I think the expanded text of section 7.3 goes a long way towards improving the clarity and understandability of this part of the conventions. Also, I welcome the proposal to introduce a standardized list of area_types. This is something that has been raised in standard names discussions in the past and I think there is a definite need for it. I tend to agree with Karl's proposal to use the "all_area_types" string rather than simply "all". I think that changing the meaning of "all" at a later date may give rise to a lot of confusion.

I have some specific comments on Jonathan's latest version of the document at http://cf-pcmdi.llnl.gov/trac/wiki.

1) Example 7.4. It is proposed to add a new paragraph immediately before example 7.4 which recommends that cell_methods be specified explicitly, even where the default interpretation applies. However, in the example the p and precip variables use the default cell_methods without stating them. Do you think that the example should be updated to reflect the new recommendation?

2) Subsection 7.3.2, paragraph 2. In the new subsection 7.3.1 it is proposed that the new name of "area" be introduced as the recommended way of indicating variation over a combination of horizontal axes. Indeed it was my suggestion to make it a recommendation, but I am now wondering if this is the best thing to do. In 7.3.2 para 2 the text goes on to say that original intervals, say in lon and lat, can be matched to the relevant axes by position. However, I think that's no longer possible using the "area" name. I suggest that this needs some clarification, for example, perhaps "area" should be the recommended name except when intervals need to be specified. Alternatively, we could drop the recommendation to use "area" and go back to Jonathan's original proposal which introduced it as a convenient, optional shorthand.

3) Subsection 7.3.2, paragraph 3. The first sentence says "... the non-standardized follows the standardized information and the keyword comment:" but the examples in the next sentence seem to imply that "comment:" only needs to be included when standardized information is present. Perhaps the text should point this out a bit more explicitly.

4) Subsection 7.3.3, paragraphs 2, 3 and 4. As already mentioned, I like the proposal to standardize the area_type strings.

Paragraph 2, sentence 2, says that "type" may be any of the strings permitted in a variable with standard name of area_type. Does this imply that "type" may also take other values and, if so, what are they?

I can't help wondering if there is really a need to state the main thrust of this proposal as two separate conventions. I think they could be drawn together into a single convention, albeit with a special case for area means. The cell_methods string for all methods could be written as: "name: method where type" and type would be either (a) a standardized area_type string or (b) an auxiliary coordinate variable or scalar coordinate variable with the standard_name of area_type. If and only if the method is "mean" and if and only if the mean is calculated using a different portion of the grid cell in the numerator and denominator the cell_methods syntax would be: "'mean where type1 over type2" where, again, type1 and type 2 are either a standardized string or a variable with standard_name of area_type. I realize that to state the convention in this way might require some re-ordering of the text and examples, but to me it seems to be a clearer and more concise description of the proposal.

Best wishes
Alison

comment:33 Changed 6 years ago by taylor13

Dear Alison,

Thanks for your careful reading and comments concerning the proposed document. I respond to your enumerated points below:

  1. I like this suggestion (to modify the example to reflect the recommendation).
  1. I think a user will not be unduly influenced by our recommendation if he really wanted to specify the interval. So I'm not sure it is necessary to explicitly include the exception you point out.
  1. Following your suggestion, I would propose that this paragraph be modified to read:

"Non-standardized information can also be included within parentheses. For instance, an area-weighted mean over latitude could be indicated as lat: mean (area weighted). If there is both standardized and non-standardized information, the non-standardized information is immediately preceded by the keyword comment:, which should follow the standardized information. For instance, to indicate both the sampling interval and the weighting method, a mean over latitude is indicated as lat: mean (interval: 1 degree_north comment: area-weighted)."

  1. The proposal is that "type" must be one of the strings permitted for area_type. To make this clear, I propose to revise the sentence to read:

"Here name could, for example, be area, and type must be one of the strings permitted in a variable with a standard_name of area_type."

Concerning the possibility of combining the two conventions into one, why don't you wait until we hear from at least Jonathan, but I think there is some merit in your suggestion. Of course the actual text would have be written before deciding whether or not it simplified things.

As an alternative toward possible simplification, I have recently thought about whether or not we shouldn't treat "area_type" more like "region". That is, currently a cell method for area is assumed to apply to the entire cell unless an auxiliary coordinate variable (or a scalar coordinate variable) with standard_name "region" has been defined, in which case the cell method applies to only that specified region. Note that in this case the cell_method does not explicitly indicate the "region" subdomain. Without much complication, we could generalize the present convention to apply to area_types. Thus, an auxiliary coordinate variable (or a scalar coordinate variable) with standard_name of "area_type" could also limit the domain to which the cell_method is applied (without noting that fact in the cell_method). This would completely duplicate the purpose of the "second convention" (which we would therefore not bother with). Concerning the "first convention", we could also choose to eliminate it as an option, again for consistency with the treatment of "region". A difficulty with this idea is that the information about the cell method might be less obvious to a human. But I don't think "area_type" and "region" are fundamentally different concepts, so why should they be treated so differently under the conventions?

Even if we unified the treatment of "region" and "area_type" in this way, we would still have to find a way to treat the case of "mean where type1 over type2". One idea would be simplifying the construction by omitting "where type1", since this could be specified through the coordinates attribute , as described in the previous paragraph. We could require that type2 be any one of the strings allowed for area_type. [There would be no advanatage to allowing type2 to be alternatively specified by an auxiliary coordinate variable or a scalar coordinate because it cannot be dimensioned greater than one.] The specification of "mean over type2" (for a horizontal area) would indicate that the mean is calculated by summing the quantity (weighted by area) over the portion of the cell determined by the cell_bounds and possibly also by coordinates attributes (that specify a "region" and/or an "area_type"), and then dividing by the area of the type2 portion of the cell.

Best regards,
Karl


comment:34 Changed 6 years ago by apamment

Dear Karl

I hadn't noticed the parallel between region and area_type until you pointed it out, but I can see the logic of your argument. However, I also agree with your comment that the information may be less obvious to a human reader if it is moved away from the cell_methods attribute. In particular, I am worried that splitting the 'mean where type1 over type2' information between cell_methods and a coordinate variable makes it less clear what is going on.

Having said that, I think that both your suggestion of using coordinate variables and Jonathan's original proposal to use cell_methods would work, so I could live with either.

Best wishes

Alison

comment:35 Changed 6 years ago by jonathan

Dear Alison and Karl

  1. I have modified the example as suggested.
  1. I would prefer to follow Karl's recommendation and not change this part. It is problematic as it stands. For instance, what if you had a combination of axes but wanted to specify the original intervals of only some of them? The existing convention is unsatisfactory and if someone has time they could propose how to tidy it up, but these problems are not caused by the present proposal, so I think that should be done in a separate ticket.
  1. Following Alison's suggestion, I have explicitly indicated that comment: should be omitted (that is what we say in the conformance document) if there is no standardised information. I note that this point too is a clarification, but not related to the purpose of the present proposal.
  1. I agree that the two conventions could be described at once, as different versions of the same thing for different purposes and with different restrictions. However, I don't think that would be clearer for the reader. I tend to feel that it is clearer to spell out the two distinct situations. They are only one paragraph each, so it is easy to see them together on the screen, and notice that are analogous.

I have uploaded a new version to http://cf-pcmdi.llnl.gov/trac/wiki.

I agree that we could describe dependence on area_type purely by means of coordinate variables. In some ways it is analogous to the restriction of the domain imposed by any other coordinate information. However I think that the use of where in cell_methods is a good idea because

  • It draws the user's attention to this very important information.
  • The first convention (naming the area type in cell_methods) is a significant convenience (not needing a coordinate variable), and the analogy between the first and second conventions is attractive (as discussed above).
  • It puts where and over together, which I think is valuable, because there is much scope for confusion about the definition of means.

Hence I would like to leave it as it is. Changing now to removing where would be a substantially different proposal from the one we have been discussing all these months. I agree there is some analogy with region. If they are to be made more similar, I would prefer to put region in cell_methods too, as Karl has suggested. It would be useful to draw the user's attention to that as well. They are not entirely analogous, however, because

  • region can be used without any other horizontal coordinate information, while area_type specifies subsets of horizontal domains that must be indicated in some other way.
  • region can be larger or smaller than horizontal gridcells, whereas area_type always refers to a variation within horizontal gridcells.

This proposal is not about region, however, so please could we leave that to another ticket.

Best wishes

Jonathan

comment:36 Changed 6 years ago by taylor13

  • Resolution set to fixed
  • Status changed from new to closed

Dear all,

There has been no further discussion on this ticket for several months, so I declare it closed. Velimir, please could you make the changes required? There are changes in in the standard document (to be marked as provisional) and the conformance document; the version number should be incremented. The changes to section 7.3 of the standards document can be copied from the latest document (July 2) accessible from:

http://cf-pcmdi.llnl.gov/trac/wiki.

Download the .doc (not the .pdf) because some of the highlighting got lost in translation.

Only the highlighted parts should appear as provisional in the new edition of the CF coventions document; the rest are corrections of defects.

Other changes that need to be made are [note: items 10 & 11 are changes that need to be made to the conformance document]:

  1. Delete the existing standard names with where-phrases, making them aliases of names without the where-phrases:

precipitation_flux_onto_canopy_where_land
surface_net_downward_radiative_flux_where_land
surface_snow_thickness_where_sea_ice
surface_temperature_where_land
surface_temperature_where_open_sea
surface_temperature_where_snow
surface_upward_sensible_heat_flux_where_sea
water_evaporation_flux_from_canopy_where_land
water_evaporation_flux_where_sea_ice

The standard name precipitation_flux_onto_canopy is an addition to the table. The others exist already.

  1. In Section 3.3, add the following to the end of the paragraph which begins "The standard name table is located at":

Some standard names (e.g. region and area_type) are used to indicate quantities which are permitted to take only certain standard values. This is indicated in the definition of the quantity in the standard name table, accompanied by a list or a link to a list of the permitted values.

  1. Define a new standard_name of area_type, to indicate distinctions of horizontal area within cells. Make the existing standard names of land_cover and surface_type into aliases of area_type. Variables with standard_name of area_type must have one of the values listed in a special table. The initial list of values in that table is: land sea sea_ice ice_free_sea land_ice snow cloud clear_sky vegetation bare_ground all_area_types. Additions to this table will be requested on the CF email list.
  1. To provide for greater use of string-valued auxiliary coordinate variables, especially string-valued scalar coordinate variables:
  • To the end of the first paragraph of 6.1, append: Other purposes for string identifiers are also described in Section 6.1.1, "Geographic Regions", and Section 7.3.3, "Statistics applying to portions of cells".
  • To the end of the second paragraph of 6.1, append: If a character variable has only one dimension (the length of the string), it is regarded as a string-valued scalar coordinate variable, analogous to a numeric scalar coordinate variable (Section 5.7).
  • Modify the section on 6.1 in the conformance document to read: A variable of character type that is named by a coordinates attribute is a label variable. This variable must have one or two dimensions. The trailing (CDL order) or sole dimension is for the maximum string length. If there are two dimensions, the leading dimension (CDL order) must match one of those of the data variable.
  1. Modify the first bullet of the section on 7.3 in the conformance document as follows:

The type of the cell_methods attribute is a string whose value is one or more blank separated word lists, each with the form

dim1: [dim2: [dim3: ...] ] method [where type1 [over type2]] [within|over days|years] [(comment)]

where brackets indicate optional words. The valid values for dim1 [dim2 [dim3 ...] ] are the names of dimensions of the data variable, names of scalar coordinate variables of the data variable, valid standard names, or the word area. The valid values of method are contained in Appendix E. The valid values for type1 are the name of a string-valued auxiliary or scalar coordinate variable with a standard_name of area_type, or any string value allowed for a variable of standard_name of area_type. If type2 is a string-valued auxiliary coordinate variable, it is not allowed to have a leading dimension (the number of strings) of more than one. When the method refers to a climatological time axis, the suffixes for within and over may be appended.

  1. Add recommendations for section 7.3 in the conformance document as follows

If a data variable has any dimensions or scalar coordinate variables referring to horizontal, vertical or time dimensions, it should have a cell_methods attribute with an entry for each of these spatiotemporal dimensions or scalar coordinate variables. (The horizontal dimensions may be covered by an area entry.)

Except for entries whose cell method is point, all numeric coordinate variables and scalar coordinate variables named by cell_methods should have bounds or climatology attributes.

That's it!

Thanks to all concerned for your work on this one.

Karl

Note: See TracTickets for help on using tickets.