Data and Information Submission at the Virginia Coast LTER

Introduction

Data and information are a critical part of the Virginia Coast LTER. However, data alone is not enough. There needs to be sufficient metadata (documentation, data about data) that someone knowledgable in the field 20 years from now can use and understand your data. In archiving data we are fighting entropy to keep our data from becoming unusable or disappearing entirely, as has been the rule in the past.

Preparing Data and Information

There is no one "right way" to prepare data for submission to the VCR/LTER. Provided that your data is in some regular or described form that can be used by others how you choose to prepare your data is up to you. We support a number of options:

Textual or graphical data
Typical examples are theses, papers or abstracts that are largely self-documenting. For these we prefer to receive three versions of the document. The first is a printed version. This is a backup for the other versions. Acid free paper is recommended. The second is as a hypertext-markup language (HTML) version. HTML can be automatically produced by many modern word processors. If not, a RTF (rich-text-file) or even a ASCII (text) file is a suitable stand-in. Finally, we would like to have a copy of the document in the original format of your word processor. This allows us to generate the HTML if you are unable to provide an HTML file. Graphics should be in GIF (Graphics Interchange Format), JPEG or Encapsulated Postscript (EPS) formats. They can also be provided as a raw graphics file as used by your graphics software, provided that you tell us what that software is.
Numerical or Coded Data includes:
- Delimited Data
  Data are arranged in a consistent form with one observation per line with individual values separated by commas or some other "delimiter." An example of this type of file is the "CSV" (comma separated values) files created by most spreadsheet programs or "SDF" (standard delimited files) produced by many types of database software. Except where information is actually missing, each line or observation should have all the values filled in.
  For example:
  Station, Month, Year, Day, Temp, Precip HOGI,10,1996,1,12,0 HOGI,10,1996,2,14,3.3 HOGI,10,1996,3,19,0
  Where data items consisting of text include the delimiter as part of the data, that data item should be included between "quote" characters (typically ").
  For example:
  Station, Month, Year, Day, Temp, Precip "HOG ISLAND, VA",10,1996,1,12,0 "REDBANK, VA",10,1996,2,14,3.3
  Note that HOG ISLAND, VA is considered to be the value of the Station variable. Without the quotes, the Station would be set to HOG ISLAND and the month would be set to VA. Most spreadsheet software does this automatically when producing CSV files.
  
  Column Formatted Data
  Data are arranged in specified columns of the data file, which may, or may not be separated by spaces. As with delimited files, there is typically one observation per line, however multi-line data structures can also be used when required. Except where information is actually missing, each line or observation should have all the values filled in.
  For example:
  Station Month Year Day Temp HOGI 10 1996 01 12 HOGI 10 1996 02 14 HOGI 11 1996 01 21 HOGI 11 1996 02 19
  
  Free Text or Labeled Data
  Data are not arranged in any particular structure, but each value is labled as to its identity. (Although we can take this kind of data, we prefer column or delimited structures).
Binary Data or Specialized Data Structures
Data are arranged in a proprietary binary or specialized export form. Examples of this type of data include ARC/INFO Export files, USGS DLG files, ERDAS .LAN, .GIS and .IMG files. Form of the data needs to be well described so that future users will be able to understand and interpret this data.

Variables

Names for Variables

Despite Shakespeare's dictum that "A rose by any other name would smell as sweet.", there is an advantage to having a good name for each of your variables. A good name is short (some software is limited in the size variable name it can handle), unique (at least within a given dataset) and descriptive of the variable contents.

Different types of software have different tolerances for long variable names. For example, the Statistical Analysis System (SAS) limits variables to 8 characters in length whereas the C or Pascal languages allow names of at least 32 characters. Among the most restrictive is the FORTRAN language which limits names to 7 characters.

For this reason, we recommend that you select variable names that are unique in their first 7 characters (even if they are longer). For example: DEPTH1WELL, DEPTH2WELL is preferred over DEPTHWELL1, DEPTHWELL2 because if you only look a the first 7 characters, the names become: DEPTH1W, DEPTH2W whereas the second two variables both become DEPTHWE and thus no longer meet the uniqueness requirement.

Variable names should always start with a letter, but may include numbers and even some special characters such as an underscore (_). However, you need to avoid special characters that could be misinterpreted (especially -,$,%,#,@,*,^,& and +). Here are some sample "good" and "bad" variable names for a reading taken from 10 cm in depth in a well:

Bad Names Why bad Better Choice

10DEPTH Should start with letter DEPTH10

DEPTH-10 Don't use - in name. (It looks like DEPTH minus 10 to some systems). DEPTH_10

DEPTH 10 No spaces in a name! DEPTH10

DEPTH_CM_10 First 7 characters unlikely to be unique DEPTH10_CM

VAR1 This is a legal name, but not a descriptive one DEPTH10

Bad Names	Why bad	Better Choice
10DEPTH	Should start with letter	DEPTH10
DEPTH-10	Don't use - in name. (It looks like DEPTH minus 10 to some systems).	DEPTH_10
DEPTH 10	No spaces in a name!	DEPTH10
DEPTH_CM_10	First 7 characters unlikely to be unique	DEPTH10_CM
VAR1	This is a legal name, but not a descriptive one	DEPTH10

Variable Types

Variables fall into three major categories based on the types of information they contain:

Character or alphanumeric variables contain letters, punctuation and numbers. Examples of character values are: "PHCK1", "A test string of characters". You can create character values which include only numbers (e.g., "100"), but you can not then use those numbers in computations.
Integer variables contain numbers with no decimal places. Valid examples of integers are: 1, 2, 3, +3, -3
Real or fixed variables contain numbers that may, or may not, include decimal points. They may also include exponents. Valid examples of real numbers are: 1, 2., 3.0, +3.0E01, -3.9999, 30.2.
Typically real numbers have the decimal point included in the number. However, in some cases the decimal point is left out and the number of decimal places specified explicitly. If you specify that two decimal places will be used, the number 289 is interpreted as 2.89. Note that not all software supports this feature.

Data Formats

Format information about a variable is especially important for column-formatted data. Typical information in a format includes the type of variable (see above), the column where the variable starts, and either the column where the variable ends or the number of digits or characters. For real or fixed variables, the number of decimal places to be used when no decimal point is explicitly included can also be included in format information. Finally, some datasets use observations that span multiple lines, you may need to specify the line (within an observation) that a value associated with a variable would be found.

Here is a simple example with two variables, DAY and TEMPERATURE.

Day      Temperature
 1           10.1
10           -3.5

The format information would be:

Variable	Type	Start Col	End Col	Width	Decimals
DAY	Integer	1	2	2	N/A
TEMPERATURE	Real	14	17	4	1

No line number needs to be specified because each observation takes only one line.

Here is the same data in a complex example. Here, each observation takes two lines and the decimal point is not explicitly given:

The format information would be:

Variable	Type	Line	Start Col	End Col	Width	Decimals
DAY	Integer	1	1	2	2	N/A
TEMPERATURE	Real	2	1	3	3	1

Data Elements

Individual data values (here called "data elements") can take many different forms, depending on the needs of the researcher. Data codes, missing values and numerical formats vary widely across datasets.

Coded Data
Some data elements do not contain an actual data value, but rather a code. For example, in coding the sex of captured animals you might designate that 1=male and 2=female. It is important to note that such values are codes. It would be inappropriate to compute an average based on such codes (just what would "average sex equals 1.23" mean!). Similarly, without the list of codes, it is not possible to interpret data. If I say that I used "Method 1" it is meaningless unless I have defined what "Method 1" was.
Our system allows you to designate a variable as coded. If so designated, you will be given the option to input values for the codes.
Some codes are used as shorthand descriptions for locations, rather than specifying a latitude or longitude in each instance. For example, the code PHK1 is used in some datasets to refer to "Phillips Creek - Mouth." An advantage is that once on researcher has defined a "named" site, other researchers can use that code without having to reenter the coordinates.
Missing Values
Even the best designed of studies often has observations that are not complete. For example, observations on a plant may be cut short if that plant dies or weather data may lack temperatures if a probe malfunctions, even though data from other sensors may be fine.
For example, with the following meterological data, the temperature probe failed in October.
```
Station Month Year     Day Temp Humidity
HOGI    09    1996     01   19  95
HOGI    10    1996     01       20
HOGI    11    1996     01   14  33
HOGI    12    1996     01   21  90
```
There are several options for dealing with a missing value.
- Leave the value blank: This has some problems in that some software does not differentiate a blank from a zero! This means that your temperature for October might be considered to be 0. Similarly, a suspicious user might wonder if you accidentally skipped a column and put the temperature in the humidity column by mistake.
- Put a period where the number would go: for example
  Station Month Year Day Temp Humidity HOGI 09 1996 01 19 95 HOGI 10 1996 01 . 20 HOGI 11 1996 01 14 33 HOGI 12 1996 01 21 90
  This makes it clear that a value should be there, although it says nothing about WHY the data is missing.
- Use different codes to indicate different reasons why the data is missing. For example, if you are looking at the average mass of seeds from a plants in a plot, you might want to differentiate data that is missing because there were no seeds from data that is missing because all the plants in the plot were dead. In that case you might want to us a code 'N' or '88' (a temperature value that would never really occur) to indicate "no seeds" and a code 'D' or '99' to indicate that no seeds were weighed because all the plants were dead.
Our system allows you to define up to 3 different "missing value" codes for each variable. Missing value codes can be either numeric or character.

Valid Values
For numerical data, our system allows you to designate a the minimum and maximum values that should be considered valid. Thus, since water temperature on the shore would never go below -5 or above 30 degrees, we could designate a minimum valid value of -5 and a maximum valid value of 30. Nothing special is done with values that are outside the valid range, but the valid range is useful in detecting errors in the data.