$Date: 2003/03/07 08:42:43 $ TOYODA Eizi, NPD/JMA
NuSDaS (Numerical Prediction Standard Dataset System) is a data I/O library for meteorological gridded data developed at Numerical Prediction Division, Japan Meteorological Agency.
This document describes the concept, data file structure, and definition file format. Consult Reference Manual about usage of interface in C and Fortran.
△
' denotes space character.Typewriter letters
represents character string that should be
literally input to computer; quotation from computer such as filename
is also written in the style.NuSDaS is designed to store gridded data in NWP (numerical weather prediction). It classifies various data and stores them into structured directory. Since it is not multi-purpose I/O library (like netCDF or HDF) nor database management system, the interface is specialized for use of NWP.
In NuSDaS, all data consists of records, a two-dimensional array of numbers. The number may be integer or floating-point number *1. In most cases, a record corresponds to grids on horizontal plane. They are identified by following dimensions (see description in Reference Manual for detailed computational expression):
A 16-character *2 string that identifies dataset.
Analysis time, initial time of forecast, or map time of observation.
Member of ensemble forecast.
Time to which forecast or observation is related to. Data for time span (such as average or accumulation) is identified using a pair of valid times.
Location of two-dimensional grids represented by record. For example, "SURF△△" and "500△△△" denotes surface and 500hPa plane, respectively. In most cases, horizontal grids are identified by vertical coordinate. Data for layer is identified using a pair of plane.
Physical quantity name. For example, "T△△△△△" and "P△△△△△" denotes temperature and pressure, respectively.
The elements of record identifier is called `dimension', since there is a metaphor that associates all accessible data to single huge array. However, storage design based on simple array would be inefficient for these reasons:
Thus records are designed to be stored in a particular file in a particular directory determined by the data identifier. (The rule is described below.)
Extension like following is expected:
All data files are located in directories called NuSDaS root directory (hereafter called as NRD). NRD has following structure:
nusdas_def/
.def
... definition fileThe NuSDaS interface searches for NRD as directories
named NUSDAS
nn, where nn is a numbers from 01 to 99.
In operational suites of JMA, NRD is firstly named in lowercase
letters ending with .nus
(such as gf_fcst_p.nus
)
by conventions, and symbolic links to it (say NUSDAS30
) is created.
A NRD may have many definition files. A definition file corresponds to a particular data type. It describes subdirectory structure and can refer many data files. Therefore there is a inclusion relation:
record ∈ data file ∈ definition file = data type ∈ NRD
Records are grouped in files by following rules:
There is two formats for data files: v1.0 File Format and ES File Format. ES (Extended Storage) is a RAM filesystem in HITACHI SR8000 system, and the ES File Format is used only in that machine.
A NuSDaS data file of v1.0 format is a sequence of records. Each record has a common structure shown in Table 1. All fields noted as 'integer' is written in the big-endian byte order. Note that it is similar but DIFFERENT *3 to sequential unformatted file of Fortran.
Possible value of the 'kind of record' field is as follows:
Record of this kind is the first mandatory record of data file. It contains metadata for system administration purpose. Payload of NUSD Record is described in Table 2.
Record of this kind is the second mandatory record of data file. It contains metadata commonly defined for all NuSDaS datasets such as identification (data type and base time), internal structure (list of member, valid time, plane and element), and grid/geometry structure. Payload of CNTL Record is described in Table 3.
Record of this kind is the third mandatory record of data file. The payload of the record is an array of 32bit integer, which can be interpreted as byte position of DATA records. Indices of the array is member, valid time, plane, and element: i-th (for simplicity, let us use C style index starting with 0) element of it gives position of a DATA record for e-th element, p-th plane, v-th valid time and m-th member; where i = (e + E * (p + P * (v + V * m))), E the total number of elements, P the total number of planes, and V the total number of valid times.
Record of this kind is optional: there may be no SUBC record, or even many SUBC records. They contains various metadata including vertical grid information, time integration span, and radar operation status. Format of the record is determined by a 4-character field (called 'group name') at byte offset 16. See Record Format for detail of SUBC records.
Record of this kind is optional: there may be no INFO record, or even many INFO records. They contains user-defined metadata. Four bytes from byte offset 16 in the record is called 'group name' and reserved for classification of INFO records. The rest of the record is not defined.
These records contain two-dimensional array data. Eight byte characters from byte offset 56 specifies encoding scheme. See Record Format for detail of DATA records.
Record of this kind is the last mandatory record of data file.
ES is a memory-based filesystem of HITACHI SR8000. Since it has special I/O interface, the NuSDaS interface had to support it separately. The file format for ES is largely different. It is direct access file including fixed length records, and has no NUSD, INDX, nor END records. Just one SUBC and one INFO must exist at the beginning of file.
I am not sure about ES for further detail.
NuSDaS definition file is a plain text file that describes structure of NuSDaS dataset. The definition file looks like free format. More precisely, the file is interpreted line by line. A Line starting with keyword (listed below, case insensitive) starts statement. Following lines without keyword at the top of themselves are continued lines and interpreted as one statement with starting line.
Statements can be omitted, unless noted 'mandatory'. There is a limitation in order of the statements. Since they are not (and cannot be easily) documented, the author recommends to describe statements in order of following description.
version of NuSDaS. If not omitted, it must be 1.0. In future versions of NuSDaS, there may be incompatible extension to the definition file, and this version will describe what version of NuSDaS you are using.
specifies the directory at which data files will be located. It is relative path from NRD. One of following syntax list is used for words.
The relative path will be template. See Pathname Expansion for special symbols. By default this style is assumed and ``/_model/_attribute/_space/_time/_name'' as template.
Equivalent to statements ``path relative_path /_3d_name'' and ``filename _validtime''.
Equivalent to statements ``path relative_path /_3d_name'' and ``filename _member''.
Equivalent to statements ``path relative_path /_3d_name/_member'' and ``filename _validtime''.
Equivalent to statements ``path relative_path /_3d_name/_basetime'' and ``filename _validtime''.
In this special case, internal file I/O will be done through ES interface, not by standard C library.
Name of data file will be filename. See Pathname Expansion for special symbols. By default, _basename is assumed.
Specifies information on creator of the data. It will written in NUSD record after prepending user name and host name.
This statement cannot be omitted. Word _model is four name characters (alphabet, number, and underline) representing model name or creation process. Word _2d is two name characters representing horizontal grid name. Word _3d is two name characters representing vertical grid name. See Reference Manual for table of possible values.
This statement cannot be omitted. Word _attribute is two name characters representing data attribute. Word _time is two name characters representing time attribute. See Reference Manual for table of possible values.
This statement cannot be omitted.
Word _name is four name characters.
You can use arbitrary name for this field;
it does not affect behavior of library nor conventional meaning.
Name "STD1"
is used for the most typical operational dataset.
Word n_dc is number of members (1 assumed by default). When inout is in, records for different members are stored in one file, and when inout is out, records for different members are stored in separated files.
lists up members.
This statement is omitted in most cases. It specifies base time. Format of YYYYmmddHHMM is same as %Y%m%d%H%M in UNIX date(1) or strftime(3).
This statement cannot be omitted. This specifies number of valid times n_vt and unit, units of numbers in following validtime1 and validtime2 statements. Word unit should be one of min, hour, day, pen, mon, week, jun. When inout is in, records for different valid times are stored in one file, and when inout is out, records for different valid times are stored in separated files.
This statement cannot be omitted. At least and just one of above two formes should appear. This statement specifies list of the first part of valid time, called valid1 in Reference Manual. When the second word is arithmetic, the valid1 is an arithmetical series with specified initial and step value. When the second word is all_list, following words are interpreted as list of valid times. Usually the list is written in ascending order. All of the arguments initial, step, vt1, ... are in units declared in previous validtime statement.
At least and just one of above two formes should appear.
This statement specifies list of the second part of valid time,
called valid2 in Reference Manual.
When the former form is used,
the list of valid2 will be (vt1 + ft1), (vt2 + ft2),
(vt3 + ft3), and so on.
Usually the list is written in ascending order.
When the latter form is used,
the list of valid2 will be (vt1 + dt), (vt2 + dt),
(vt3 + dt), and so on.
All of the arguments dt, ft1, ft2, ... are
in units declared in previous validtime statement.
If this statement is omitted,
the special value -1
is assumed as valid2.
This statement cannot be omitted. Specifies the number of planes.
This statement cannot be omitted.
Specifies the list of first plane.
The list should have n_lv items.
Usually the list is written in ascending order in height.
It looks like descending order if pressure coordinate is used,
(e.g. SURF 1000 950 900 ...
).
Specifies the list of second plane. The list should have n_lv items. If this statement is omitted, the same list to that in plane1 is assumed.
This statement cannot be omitted. Specifies the number of elements.
This statement cannot be omitted, and will appear n_el times. It describes where is the element elemname allowed to write. See section Elementmap for detail.
This statement cannot be omitted, It indicates that the number of grid points is nx in X direction, and ny in Y direction. In most cases X is taken eastward and Y northward, although that is dependent to what coordinate system (_2d in type1 statement) you use.
This statement indicates that the location of grid numbered (ix, iy)
is positioned (lon, lat).
Both of ix, iy must be real number,
lon must be real number with 'E
' or 'W
' appended,
lat must be real number with 'N
' or 'S
' appended,
Note that this statement is used with the geographical meaning shown above
even if the 2D grid is taken vertically.
In order to describe vertical grid point locations,
SUBC record might be used.
Indicates horizontal distance (in X and Y directions) between adjacent grid points. The units is degree when the grids is latitude-longitude grids, and is meter when map projection is applied. When the 2D grid is taken vertically, one of dx, dy shall be ignored. Note that the meridional grid distance dy is taken southward. It is positive in most JMA models: grid points with the smallest Y index are located at the northern end of 2D grid. On the contrary, if dy is negative, grid points with the smallest Y index are located at the southern end of 2D grid.
Specifies standard longitude/latitude. They are parameters of map projection, and only a part of them is used in some cases. It is dependent to horizontal grid style whether this statement is required or not. See following description of others.
Specifies 3rd or 4th longitude/latitude. Meaning of parameters is dependent to projection. It is also dependent to horizontal grid style whether this statement is required or not.
The Lambert conformal projection has 3 parameters; use "standard LoV Latin1 LoV Latin2", where LoV is Y-axis longitude, and Latin1 and Latin2 is the first/second latitude where the secant cone cuts the earth. In most cases of JMA, it looks like
standard 140.0E 30.0N 140.0E 60.0N
The polar stereographic projection has 2 parameters; use "standard LoV LaD 0E 0N", where LoV is Y-axis longitude, and LaD is the latitude where grid point distance is defined. In most cases of JMA, it looks like
standard 140.0E 30.0N 0E 0N
The Mercator projection has one parameter; use "standard 0E LaD 0E 0N", where LaD is the latitude where grid point distance is defined.
The Lambert conformal projection has 3 parameters; use "standard LoV Latin1 LoV Latin2" and "others LoP LaP RotAngE 0N", where LoV is Y-axis longitude, (Latin1, Latin2) is the first/second latitude where the secant cone cuts the earth, (LoP, LaP) is longitude/latitnude of the projection southern pole, and RotAng is the angle of rotation after projection. Unfortunately, the practice in JMA has been failed to write this parameter properly and you may have data with zero-filled corresponding fields (as for 2003-03-07).
Since there is no projection parameters, standard or others statements should not be written.
Describes how gridded data represents field. Word representation should be one of them:
values at grid point. This is the default.
average over volume/area around grid point
representative value obtained with another method
Describes encoding scheme to be used in DATA record.
See Reference Manual for table of possible values.
By default, 2PAC
is assumed.
Describes how missing value is to be represented. Word miss_mode should be one of them:
There is no method for missing value in this case. This is the default.
A certain value is missing value, and grids with the value should be regarded missing.
Grid points with valid data are indicated with bitmap for each DATA record. See NUSDAS_MAKE_MASK() in Reference Manual for detail.
If the definition file has this statement, INFO record will be written at the time of data file creation. It can be stated as many as needed. Size and contents of the INFO record will be that of file specified with a relative path filename. Word group should be a four-character name that identifies the INFO record.
If the definition file has this statement, SUBC record is allocated at the time of data file creation. Each SUBC record is secified with a pair of group (four-character name that identifies the SUBC record) and size (size of the SUBC record). Word num specifies the number of group-size pairs.
This statement is required if you use ES interface. If the definition file has this statement, each records in data file will have size bytes. Padding of (size - (payload size)) bytes is used after record payload. Error occurs if a record exceeds the specified size. By default, records are aligned contiguously (without padding between record payload and 4-byte record trailer).
Pathname of data file is determined by path and filename statements in the definition file, after substitution of following keywords to values of data identifier.
keyword | meaning |
_model | model name, first 4 characters of type1 |
_2d | 2D grid structure, 5th and 6th characters of type1 |
_3d | 2D grid positioning, 7th and 8th characters of type1 |
_attibute | first two characters of type2 |
_time | time attribute, last two characters of type2 |
_name | type3 |
_space | equivalent to '_2d_3d' |
_base | base time |
_valid | valid time |
_member | member |
Note that plane and element is not used in pathname expansion,
since they cannot 'split' file.
Similarly, using '_valid
' or '_member
' will cause malfunction
if you declare 'valid
... in
' or 'member
... in
'
respectively.
On the other hand, if you declare 'valid
... out
' or
'member
... out
',
you must use '_valid
' or '_member
' respectively in
path or filename statements;
otherwise data files for different valid times or members will collide
(have same names and may cause malfunction).
Elementmap defines whether a certain element is allowed or not
for certain combination of member, valid time, and plane.
To understand elementmap, first think of a bitmap of size
M * V * P (or Fortran logical array with
DIMENSION(
P, V, M)
),
where M, V, P
are total number of members, valid times, and planes.
For each bit, '1' declares that the element is allowed,
and '0' does oppositely.
Elementmap written in the definition file is the bitmap
in a kind of run-length-encoding (RLE) compression.
The syntax of elementmap is written in BNF as follows:
They are interpreted as follows:
The author admits the rule above is far from human understanding. Indeed, terms vtime_loop or member_loop are hardly used. If you are not sure, declare elements with contiguous_line. It will look like following:
element 4 elementmap PSEA 0 elementmap T 0 elementmap U 0 elementmap V 0
Allowing too much data records does not mean increase of data file size or data access speed/latency. Thus you can safely declare elements with 'no limitation' settings.
Records of NuSDaS data file have common beginning and ending (shown in Table 1). Following tables describes the PAYLOAD part.
Offset | Length | Type | Description |
byte | byte | ||
0 | 4 | integer | n: record size |
4 | 4 | character | kind of record |
8 | 4 | integer | m: payload size |
12 | 4 | integer | creation date and time in time_t value |
16 | m - 8 | --- | PAYLOAD of record; see Table 2--6 for detail |
8 + m | n - m - 8 | --- | padding; should be ignored |
n - 4 | 4 | integer | n: record size |
Note that the `Type' is written in strange notation deliberately. They should NOT be directly interpreted as a type name of certain programming language, like C or Fortran.
Byte value should be interpreted as character code of ISO 646 IRV. Meaning of byte whose MSB is set is currently undefined.
Certain number (usually 4) of bytes represents signed integer value. Negative value is represented with complement of 2. Note that big endian ordering of bytes is always used in NuSDaS data file.
Certain number (usually 4) of bytes represents unsigned integer value.
Bits in 4 or 8 bytes are used to compose IEEE 754 floating point value.
Some field is array, and that is indicated in notation like C.
For example, a field noted character [2][n_lv][6]
is
equivalent to memory image of unsigned char [2][n_lv][6]
in C
or CHARACTER(LEN = 6), DIMENSION(N_LV, 2)
in Fortran.
However, one-dimensional array notation '[size]' for scaler character field
is omitted for simplicity.
Offset | Length | Type | Description |
byte | byte | ||
16 | 80 | character | creator host and user name. |
96 | 4 | integer | NuSDaS version: currently 1 |
100 | 4 | unsigned integer | total number of bytes in file |
104 | 4 | integer | number of records in file |
108 | 4 | integer | number of INFO records in file |
112 | 4 | integer | number of SUBC records in file |
Offset | Length | Type | Description |
byte | byte | ||
16 | 16 | character | data type |
32 | 12 | character | base time in format like "date +%Y%m%d%H%M" |
44 | 4 | integer | base time in sequential minute from 1801-01-01T000Z |
48 | 4 | character | time unit for valid times |
52 | 4 | integer | n_dc: number of members |
56 | 4 | integer | n_vt: number of valid times |
60 | 4 | integer | n_lv: number of planes |
64 | 4 | integer | n_el: number of elements |
68 | 4 | character | map projection |
72 | 2 * 4 | integer [2] | number of grid points in X and Y directions |
80 | 2 * 4 | floating [2] | grid index of reference point |
88 | 2 * 4 | floating [2] | latitude/longitude of reference point |
96 | 2 * 4 | floating [2] | latitude/longitude distance between grid points |
104 | 2 * 4 | floating [2] | 1st STD latitude/longitude of map projection |
112 | 2 * 4 | floating [2] | 2nd STD latitude/longitude of map projection |
120 | 2 * 4 | floating [2] | 3rd STD latitude/longitude of map projection |
128 | 2 * 4 | floating [2] | 4th STD latitude/longitude of map projection |
136 | 4 | character | PVAL: representation method of grid |
140 | 2 * 4 | --- | reserved for future use of map projection |
148 | 6 * 4 | --- | reserved for future use |
172 | n_dc * 4 | character [n_dc][4] | list of member name |
(1) | n_vt * 8 | integer [2][n_vt] | list of valid time pair |
(2) | n_lv * 12 | character [2][n_lv][6] | list of plane pair |
(3) | n_el * 6 | character [n_el][6] | list of element name |
This kind of SUBC record is employed to describe vertical grid structure. You can get pressure by p[k] = b[k] * (p_surface - c) + a[k], where k is the index of vertical plane and p_surface the surface pressure.
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | character | "ETA△" or "SIGM" |
20 | 4 | integer | number of planes |
24 | (n_lv + 1) * 4 | float [n_lv + 1] | parameter a |
... | (n_lv + 1) * 4 | float [n_lv + 1] | parameter b |
... | 4 | float | parameter c |
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | character | "Z*△△" |
20 | 2 * 4 | integer | nx and ny: number of grid points in X and Y directions |
28 | 4 | integer | number of planes |
32 | (n_lv + 1) * 4 | float [n_lv + 1] | z-star location for each plane |
... | 4 | float | height of model top |
... | (nx * ny) * 4 | float [nx * ny] | surface height |
This kind of SUBC record is employed for time integration/average product. The size of SUBC TDIF record depends on parameters n_dc (members) and n_vt described in CNTL record.
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | character | "TDIF" |
32 | n_dc * n_lv * 4 | integer [n_dc][n_lv] | difference between accurate valid time and nominal valid time |
... | n_dc * n_lv * 4 | float [n_dc][n_lv] | integration time in seconds |
This kind of SUBC record is used for datasets of radar observation. The size of SUBC RADR record depends on parameters n_dc (members), n_vt, n_lv, and n_el described in CNTL record.
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | character | "RADR" |
32 | n_dc * n_vt * n_lv * n_el * 4 | integer [n_dc][n_vt][n_lv][n_el] | flags |
Value of flags has these means:
ND.
Echo exists.
No echo exists.
No operation.
This kind of SUBC record is used for datasets of synthesized multiple radar observations. The size of SUBC ISPC record depends on parameters n_vt, n_lv, and n_el described in CNTL record.
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | character | "ISPC" |
32 | n_vt * n_lv * n_el * 512 | integer [n_vt][n_lv][n_el][128] | flags |
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | character | member name |
20 | 8 | integer [2] | valid times |
28 | 12 | character [2][6] | plane names |
40 | 6 | character | element name |
46 | 2 | --- | reserved |
48 | 2 * 4 | integer[2] | nx and ny: number of grid points in X and Y directions |
56 | 4 | character | packing scheme such as "2PAC" |
60 | 4 | character | "NONE" |
64 | ... | ... | PACKED DATA: see following description |
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | character | member name |
20 | 8 | integer [2] | valid times |
28 | 12 | character [2][6] | plane names |
40 | 6 | character | element name |
46 | 2 | --- | reserved |
48 | 2 * 4 | integer[2] | nx and ny: number of grid points in X and Y directions |
56 | 4 | character | packing scheme such as "2PAC" |
60 | 4 | character | "UDFV" |
64 | (various) | integer/floating | missing value |
... | ... | ... | PACKED DATA: see following description |
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | character | member name |
20 | 8 | integer [2] | valid times |
28 | 12 | character [2][6] | plane names |
40 | 6 | character | element name |
46 | 2 | --- | reserved |
48 | 2 * 4 | integer[2] | nx and ny: number of grid points in X and Y directions |
56 | 4 | character | packing scheme such as "2PAC" |
60 | 4 | character | "MASK" |
64 | 4 | integer | n_ms: number of bytes used for mask bitmap |
68 | n_ms | bitmap | mask bitmap |
... | ... | ... | PACKED DATA: see following description |
When the packing scheme is 1PAC
, 2PAC
, or 2UPC
,
two 4-byte floating-point field base and amp
is followed by an array of packed type.
See Reference Manual about the packed type.
Unpacking is adding base after multiplying amp.
When the packing scheme is 4PAC
, it is similar to 2PAC
but
base and amp is 8-byte floating-point value.
When the packing scheme is RLEN
,
three 4-byte integer field nbit, maxv, num
is followed by octet stream containing compressed bit stream.
When the packing scheme is GRIB
,
the GRIB octet stream itself will be the packed data;
although this feature is not implemented yet.
Otherwise, the packed data is array of packed type.
Note that if the packing scheme is 'N1I2
' the packed value
is 10 times of unpacked value.
Offset | Length | Type | Description |
byte | byte | ||
16 | 4 | unsigned integer | total number of bytes in file |
20 | 4 | integer | number of records in file |
*1see User Data Array Types table at the bottom of
Reference Manual for available data types
*2Pandora data server and some tools uses
notation using period (such as _GSMLLPP.FCSV.STD1
) for readability
*3In NuSDaS, the size of record size fields (4 + 4 = 8 bytes) is
INCLUDED in the record size itself,
while is is NOT INCLUDED in Fortran files.
It may be changed in future versions.
Also note that the Fortran file format is usually written in the native
byte order of creating computer.