matrix2png: file format details
home
This page describes the data file format used by the software. The data file is a tab-delimited text file with one row representing the dependent variable measurements for one set of observations.
Quick tips to avoid common file format problems:
- The input files are tab delimited. Comma or
space-delimited files will not work.
- Missing values are okay.
- Notice the 'corner' string in the example below -
all columns including the example names have a heading. It does not
matter what you put in the corner, but it must not be blank. The
parser uses the header to figure out how many features you have, so if
you skip the corner string it will appear that you have extra data,
resulting in an error message.
- You can only have one column of descriptors, all other
data must be your numeric feature data. In other words, don't
include extra columns in your file that are not part of the data or
the example labels. Extra columns will either result in an error
(most likely) or invalid results (if your extra columns look like
data).
For microarray analysis, this means that each row represents the expression measurements for one gene. The columns then represent
different arrays which were run. It helps in later analysis if the data columns are arranged by condition: for example,
put the "wild type" columns all together and the "mutant" columns all together after that. So the top of your data file might
look like this (when nicely formatted)
gene | mutant | mutant | mutant | wildtype | wildtype | wildtype
|
100001_at | -36.3 | 77.8 | 64.4 | 89.4 | 126.6 | 86.2
|
100002_at | 1504.2 | 1512 | 944.5 | 1157.9 | 1652 | 1358.9
|
100003_at | 845.9 | 966.5 | 1057.4 | 987.4 | 764.1 | 878.5
|
100004_at | 2304.4 | 1991.1 | 2783.7 | 1929.8 | 2236.8 | 2664.1
|
100005_at | 3826.5 | 2876.9 | 4514.1 | 3187.8 | 2454.3 | 3730.6
|
100006_at | 3635 | 2584.6 | 3554.9 | 2810.9 | 1629 | 2248.6
|
100007_at | 6328.4 | 6197.8 | 7236.4 | 6224.9 | 6950 | 6206.8
|
100009_r_at | 6580.6 | 8715.9 | 5280.3 | 6569.4 | 8513.4 | 7236
|
100010_at | 368.2 | 344.5 | -62.4 | 200 | 282.7 | 583.4
|
100011_at | 1949.7 | 2511.3 | 1937.8 | 2684.1 | 1722.5 | 2101.3
|
100012_at | 3145.6 | 2936.7 | 3358.4 | 4250.8 | 2706.4 | 2776
|
100013_at | -1098.4 | -720.8 | -1418.8 | -886.9 | -764.4 | -1247.6
|
100014_at | 1108 | 1197 | 985.4 | 1216.7 | 1328.1 | 1161.5
|
100015_at | 6005 | 1040.6 | 4434.1 | 1069.4 | 864.8 | 2617.4
|
100016_at | 4485.3 | 3236.2 | 4910.2 | 3474.6 | 3447.1 | 3493
|
100017_at | 497.5 | 399.3 | 964.2 | 347.7 | 524.5 | 561.3
|
100018_at | 540 | 1209.7 | 811.1 | 1880.8 | 317.9 | 587.8
|
100019_at | -303.5 | 46.4 | 0.9 | 53.4 | -252.6 | -346.9
|
100020_at | 1606.3 | 1570.4 | 1996.6 | 3319.7 | 1803.4 | 1811.7
|
100021_at | 1349.8 | 1193.5 | 764.7 | 331.5 | 1175 | 783.9
|
(etc, possibly for many more lines)
There are a number of things to be careful about in setting up your file
- It must be plain text. In excel, use the "file->save as..." menu selection and then choose "text (tab delimited)" option. Excel
will warn you about keeping the file in this format; this is just an annoyance but can be ignored. See "Saving excel files as text"
for more detailed instructions.
- All row and column names must not contain spaces. This is because our parsers may get confused by names containing spaces.
All columns and rows must have something written in them, including the "upper left corner".
- Missing values are permitted, but each row must have the
same number of fields. Files with "ragged" rows are not
acceptable. Thus if your data file has missing values at the end of a
row, there should be extra tab characters to represent them. One way
to fix this is to open the file in a program like Microsoft Excel and
resave it as tab-delimited text. Excel adds the extra tab characters.
- The values in the file, other than the row and column labels, must be numbers (no letter or other characters). This means
that there can be no other columns of text.