This module provides the Timeseries class, which represents a time series, and the helper TimeStep class, which represents a time step.
A time series record has two time stamps: the nominal timestamp and the actual timestamp. The one that is stored and displayed is the nominal timestamp; the one that is meant is the actual timestamp. For example, in a monthly time series, the nominal timestamp could be 2008-01-01 00:00, meaning January 2008 and probably displayed by application software as 2008-01; but this could mean “the time period that begins at 2008-01-01 08:00 and ends at 2008-02-01 08:00”. In that case, the actual timestamp would be 2008-02-01 08:00, because we make the convention that actual timestamps mark either a moment or the end of an interval.
A pair of integers indicating the number of minutes and months that must be added to a round timestamp to get to the nominal timestamp. For example, if an hourly time series has timestamps that end in :13, such as 01:13, 02:13, etc., then its nominal offset is 13 minutes, 0 months, i.e., (13, 0). Monthly time series normally have a nominal timestamp of (0, 0), the timestamps usually being of the form 2008-02-01 00:00, meaning “February 2008” and usually rendered by application software as 2008-02. Annual timestamps have a nominal timestamp which normally has 0 minutes, but may have nonzero months; for example, a common offset in Greece is 9 months, which means that an annual timestamp is of the form 2008-10-01 00:00, normally rendered by application software as 2008-2009, and denoting the hydrological year 2008-2009.
nominal_offset may be None, meaning that the timestamps can be irregular.
A Timeseries class works like a dictionary. If t is a Timeseries object, t[date] is the value (may be float('nan') to denote a missing value), and t[date].flags is a set of strings. The dictionary keys are either datetime.datetime objects or ISO 8601 strings. You may set a value like this:
t[date] = number # keeps flags as they were, if record existed
t[date] = (number, flags)
Timeseries class depends on the custom library ts_core, written in standard C language, which is used for memory and file storage operations of time series objects in order to improve for performance and for memory consumption. The use of the core library should not affect the developer who can use Timeseries class like every Python dictionary object. The only difference is that the dictionary object is always sorted by date since with every add / insert operation new items are placed automatically in the right position to keep the dictionay sorted. There is no need to call timeseries.keys().sort(). ts_core library is required, for installation see the bundled text in the ts_core directory in the repository.
Create a new Timeseries object. The arguments set initial values for the attributes described below.
- Timeseries.id¶
- The id of the time series in the database. This attribute is only used by read_from_db() and write_to_db(). When these methods are called, id specifies the id of the time series.
- Timeseries.driver¶
- The SQL driver used for some specific database operations such as blob field writing. It may have the values of Timeseries.SQLDRIVER_PSYCOPG2 for PostgreSQL or Timeseries.SQLDRIVER_NONE for non database applications.
- Timeseries.SQLDRIVER_PSYCOPG2¶
- A class member used to specify the database driver for PostgreSQL access. This is the default driver for Timeseries objects.
- Timeseries.SQLDRIVER_NONE¶
- A class member used to specify the database driver for non database application. Use this driver when you wish not to load a database driver such as psycopg2 in your application.
- Timeseries.unit¶
- Timeseries.title¶
- Timeseries.timezone¶
- Timeseries.variable¶
- Timeseries.comment¶
- The above text attributes are informational and can hold anything at all; comment, in particular, may be multiline while the rest should not. They are set by read_file() and used by write_file(). Other than that, they are not used.
- Timeseries.precision¶
This integer attribute specifies the number of decimal digits to which the values are precise. It can also be zero or negative; if, for example, it is -2, values are precise to the hundred.
The attribute is set by read_file() and used by write_file(). It is currently not used anywhere else within the class, but a user interface that displays values to the user might use it in order to determine how many decimal digits to display. It can be None, meaning unknown or unset.
- Timeseries.read(fp)¶
- Read time series from the filelike object fp, which must be in text format; preserve original contents (unless overwritten).
- Timeseries.write(fp[, start][, end])¶
Write time series to the filelike object fp, in text format. If datetime.datetime objects start and end are mentioned, only write that range.
In accordance with the text format specification, time series are written using the CR-LF sequence to terminate lines. In order to produce fully compliant files, care should be taken that fp, or any subsequent operations on fp, do not perform text translation; otherwise, it may result in lines being terminated with CR-CR-LF. If fp is a file, it should have been opened in binary mode.
Write plain values to a filelike object fp, in a csv like format but without the c of csv. Each line of the text file contains one value only representing the actual value of the nth step of the time series. No timestamp or flags are specified. Null values are represented with the nullstr sequence; default is an empty string causing empty lines for null value records.
Write time series to the filelike object fp, in file format. If datetime.datetime objects start and end are mentioned, only write that range.
See also write() for information on the handling of the line terminators.
Write time series to database, entirely overwriting any existing with the same id. Note that only the data are written, and not any metadata such as time step information.
db is an object that has a cursor() method that returns a PEP 249 Cursor Object. For example, db can be a PEP 249 Connection Object or a django.db.connection object.
This method also needs to be able to commit and rollback, and therefore it needs an object that has methods commit() and rollback(). If transaction is None, it is assumed that db has these methods; otherwise, transaction is used. If db is a PEP 249 Connection Object, you can therefore leave transaction unspecified; but if db is, for example, a django.db.connection object, then you should set transaction to django.db.transaction.
If commit is False, then the time series are written to the database without being committed (in that case, you don’t need to specify transaction).
If you call this function from django, either put the @transaction.commit_manually decorator on the caller, or use commit=False and find another way to commit changes, such as @transaction.commit_on_success, transaction.commit_unless_managed(), and transaction.set_dirty(). Read Transactions and raw SQL in Performing raw SQL queries in the Django documentation for details.
Return minimum, maximum, average, or sum of the time series. If start_date and/or end_date are specified, the result is the minimum, maximum or average value for the specified interval.
If the value cannot be computed (e.g. because the time series does not have any not-null values in the specified interval), these functions return float("NaN"), with the exception of sum(), which returns zero.
Process the time series, produce two new time series, and return these new time series as a tuple. The first of these series is the aggregated series; the second one is the number of missing values in each time step (more on this below). Both produced time series have a time step of target_step, which must be a TimeStep object. The nominal_offset, actual_offset, and interval_type attributes of target_step are taken into account during aggregation; so if, for example, target_step is one day with nominal_offset=(480,0), actual_offset=(0,0), and an interval_type of IntervalType.SUM, then aggregation is performed so that, in the resulting time series, a record with timestamp 2008-01-17 08:00 contains the sum of the values of the source series from 2008-01-16 08:00 to 2008-01-17 08:00.
If target_step.interval_type is IntervalType.VECTOR_AVERAGE, then the source records are considered to be directions in degrees (as in a wind direction time series); each produced record is the direction in degrees of the sum of the unit vectors whose direction is specified by the source records.
If target_step.interval_type is None, corresponding to instantaneous values, then for each record of the destination series, a record from the source time series is selected if this has the same nominal step. If a record is not found, then the resulting record is set as NULL.
If some of the source records corresponding to a destination record are missing, missing_allowed specifies what will be done. If the ratio of missing values to existing values in the source record is greater than missing_allowed, the resulting destination record is null; otherwise, the destination record is derived even though some records are missing. In that case, the flag specified by missing_flag is raised in the destination record. The second time series returned in the return tuple contains, for each destination record, a record with the same date, containing the number of missing source values for that destination record.
If last_incomplete set to True, then the last record of the destination time series, can be derived from an incomplete month, year etc. If all_incomplete is set to True, then all the destination records are from aggregation to the same point as the last incomplete record. This is usefull to find i.e. the rainfall up to the same day for the year, when that day is the last daily record to be aggregated.
Timeseries objects can load and save their records in plain text files or in a database. There are three formats: the text format is generic text format, without metadata; the file format is like the text format, but additionally contains headers with metadata; and the database format is for storing to the database. These three formats are described below.
The text format for a time series is us-ascii, one line per record, like this:
2006-12-23 18:34,18.2,RANGE
The three fields are comma-separated and must always exist. In the date field, the time may be missing. The character that separates the date from the time may be either a space, or a lower case t, or a capital T (Timeseries objects produce text format using a space as date separator, but can read text format that uses t or T). The second field always uses a dot as the decimal separator and may be empty. The third field is usually empty but may contain a list of space-separated flags. The line separator should be the CR-LF sequence used in MS-DOS and Windows systems. Code that produces text format should always use CR-LF to end lines, but code that reads text format should be able to also read lines that end in LF only, as well as CR-CR-LF (for reasons explained in Timeseries.write()).
In order to improve performance in file writes, the maximum length of each time series record line is limited by a number of 255 characters. With a fix date string of 16 characters, three commas, a value string with a mean size of 10 characters, this is leaving about 220 characters per line for flags. Assuming a mean size of 10 characters for each flags, this leaves space for 20 flags per record which is more than sufficient. An attempt to write more than 255 characters, raise an exception and stops every file write.
Flags should be encoded in ASCI (7 bit) character set. In case of characters with code>127, the string will have errors in encodings and probably this will stop some file operations. Client software should prevent the writing of non ASCI characters for flags.
The file format is like this:
Version=2
Title=My timeseries
Unit=°C
2006-12-23 18:34,18.2,RANGE
2006-12-23 18:44,18.3,
In other words, the file format consists of a header that specifies parameters in the form Parameter=Value, followed by a blank line, followed by the timeseries in text format. The same conventions for line terminators apply here as for the text format. The encoding of the header section is UTF-8.
Client as well server software should recognize UTF-8 files with or without UTF-8 BOM (Byte Order Mark) in the begining of file. Writes may or may not include the BOM, according OS. (Usually Windows software attaches the BOM at the beginning of the file).
If header is omited (not a Version=2 is included), then read_file method will try to read the file as raw data file by trying to parse dates, values, flags from the begining. If a Version=2 string is included then the head is parsed as a meta section and a blank line as separator between head and data is expected.
Parameter names are case insensitive. There may be white space on either side of the equal sign, which is ignored. Trailing white space on the line is also ignored. A second equal sign is considered to be part of the value. The value cannot contain a newline, but there is a way to have multi-lined parameters explained in the Comment parameter below. All parameters except Version are optional: either the value can be blank or the entire Parameter=Value can be missing; the only exception is the Comment parameter.
The parameters available are:
A multiline comment for the time series. Multiline comments are stored by specifying multiple adjacent Comment parameters, like this:
Comment=This timeseries is extremely important
Comment=because the comment that describes it
Comment=spans five lines.
Comment=
Comment=These five lines form two paragraphs.
The Comment parameter is the only parameter where a blank value is significant and indicates an empty line, as can be seen in the example above.
Time_step
Nominal_offset
Actual_offset
These three parameters specify the time step; each one is a pair of comma-separated integers, like this:
Time_step=1440,0 Nominal_offset=480,0 Actual_offset=0,0The first number designates minutes and the second designates months. If nominal_offset is missing, it means that the time series records can have irregular timestamps. If time_step is present, actual_offset must also be present. If time_step is missing, it means that the time series is irregular. For more information on these three parameters, refer to the Timeseries documentation.
Interval_type
Has one of the values sum, average, maximum, minimum, and vector_average. If absent it means that the time series values are instantaneous, they do not refer to intervals. For more information on this parameter, refer to TimeStep.
Variable
A textual description of the variable, such as Temperature or Precipitation.
Precision
The precision of the time series values, in number of decimal digits after the decimal separator. It can be negative; for example, a precision of -2 indicates values accurate to the hundred, such as 100, 200, 300 etc.
The database format is an extension of the text format. The time series records are stored in a database table with three columns named top, middle and bottom. top and bottom are plain text (e.g. PostgreSQL TEXT or Oracle TLOB), whereas middle is a binary data field (e.g. PostgreSQL BYTEA or Oracle BLOB) that contains data compressed with the LZ77 algorithm. The concatenation of top, uncompressed middle, and bottom, is the entire time series in text format. top is a non-nullable column, but may contain an empty string; middle is nullable; and bottom is non-nullable.
Note
middle contains only the compressed data, and no header, checksum, or anything else. As a result, programs such as gzip and pkzip cannot read it; instead, free libraries may be used when implementing this functionality, such as Python’s zlib, C’s zlib, Perl’s IO::Zlib, and Delphi’s TCompressionStream and TDecompressionStream.
top stores the first few lines of the time series text format, up to around 100. bottom stores the last few lines of the file, at least one. middle stores all the rest. bottom is non-nullable and may not be empty; if a time series is empty, there must be no row in database table. If it contains only a few records, they must all be stored in bottom, the other two fields being empty. If it contains more records, a few must be stored in top, another few in bottom, and the rest in middle. Appending a record to the timeseries is usually accomplished by simply appending to bottom.
The details of the operation depend on the code that implements the database format. The operation of this module is detailed below, and you would normally not care about it unless you write another implementation. In that case, you should follow a similar algorithm when writing to the database, although there are only two requirements that cannot be violated:
Note
Why use this seemingly paradoxical system? The reason is that, by storing each time series as essentially one compressed unit, rather than, e.g., in a (id, date, value, flags) database table, we can retrieve it many times faster. Storing time series in a relational manner would not make much sense, because they are inherently not relational. About 20 times less disk space is being used. In addition, large time series are uncompressed on the client, thus easing network and server load. Finally, if ‘top’ and ‘bottom’ are kept small, it is very fast to perform the frequently needed operations of retrieving the first and last records and appending a record. All other operations must practically retrieve/update the entire time series, which experience has shown that it is what is done anyway.
The database table must be complemented with two database functions, timeseries_start_date and timeseries_end_date, which accept a single id argument and return the start or end date of the time series. For example:
hydrotest=> select timeseries_start_date(696), timeseries_end_date(696);
timeseries_start_date | timeseries_end_date
-----------------------+---------------------
1950-08-01 08:00:00 | 1997-03-31 08:00:00
(1 row)
The algorithm used by this module for storing timeseries is as follows: Let MAX_ALL_BOTTOM be the maximum number of records that a time series may have if it is to be entirely stored in bottom; ROWS_IN_TOP_BOTTOM the number of time series records in top and in bottom; MAX_BOTTOM the maximum number of records allowed in bottom; and MAX_BOTTOM_NOISE noise to be added or subtracted (more on this below). At the time of this writing, these constants have the values 40, 5, 100 and 10 respectively.
When a time series is to be entirely written to the database (i.e. merely appending rows), it is written as follows:
When appending to the database, the operation is as follows:
This is done in order to avoid bottom from growing too much. The reason noise is being used is in order to avoid reaching circumstances where 20 or so time series will be repacked altogether. For example, consider a program that every 10 minutes appends data from an automatic meteorological station with 20 sensors that measure 20 timeseries. With MAX_BOTTOM=100 and ROWS_IN_TOP_BOTTOM=5, it is possible that every 95 updates all 20 time series would have to be repacked, which can be a great load. But if we add a random ±10 to the test, then once in a while one or two time series will be repacked.