Boulder.pm - A semantic free data interchange format
Download the distribution (tar archive)
Boulder IO is a simple TAG=VALUE data format designed for sharing data
between programs connected via a pipe. It is also simple enough to use
as a common data exchange format between databases, Web pages, and other
data representations.
The basic data format is very simple. It consists of a series of TAG=VALUE
pairs separated by newlines. It is record-oriented. The end of a record is
indicated by an empty delimiter alone on a line. The delimiter is "=" by
default, but can be adjusted by the user.
An example boulder stream looks like this:
NAME=AFM036YB2
SEQUENCE=GATCCGAGATACAGAGGATCCCACACACACACACAC....
LEFT_START=0
LEFT_LENGTH=14
RIGHT_START=140
RIGHT_LENGTH=20
=
NAME=MR1239
SEQUENCE=GGGATCTTTCTCCACCTCCAGCGCACAGAGCAGGGGG...
LEFT_START=2
LEFT_LENGTH=20
RIGHT_START=130
RIGHT_LENGTH=18
ALIAS=WI-1000
ALIAS=XB2023
=
Notes:
-
There is no need for all tags to appear in all records, or
indeed for all the records to be homogeneous.
-
Multiple values are allowed, as with the ALIAS tag in the
second record.
-
Lines can be any length, as in a potential 40 K sequence entry.
-
Tags can be any alphanumeric character (upper or lower case) and may contain
embedded spaces. Conventionally we use the characters A-Z0-9_, because
they can be used without single quoting as keys in Perl associative arrays, but
this is merely stylistic. Values can be any character at all except for the
reserved characters {}=% and newline. You can incorporate binary data into
the data stream by escaping these characters in the URL manner, using a % sign
followed by the (capitalized) hexadecimal code for the character. The
module makes this automatic.
The simple boulder format has been extended to accomodate nested relations and
other intresting structures. The new format allows nested records to be
created in this way:
NAME=AFM036YB2
SEQUENCE=GATCCGAGATACCCCACACACACACACAC....
STS={
LEFT_START=0
LEFT_LENGTH=14
RIGHT_START=140
RIGHT_LENGTH=20
}
=
NAME=MR1239
SEQUENCE=GGGATCTTTCTCCACCTTGGAGAGCAGGGGG...
STS={
LEFT_START=2
LEFT_LENGTH=20
RIGHT_START=130
RIGHT_LENGTH=18
}
ALIAS=WI-1000
ALIAS=XB2023
=
As in the original format, tags may be multivalued. For example, there might
be several STS records assigned to a sequence. Each subrecord may contain
further subrecords.
A major attribute of the boulderio protocol is the convention that programs
reading boulderio streams should only remove from the stream those tags
that they're interested in. Any unrecognized tags are passed through to
other programs that might be interested. In fact, most programs will want
to put the tags back into the boulder stream once they're finished, potentially
adding their own. Of course some programs will want to behave differently.
For example, a database query program will generate but not read a boulderio
stream, while a report generator will read but not write the stream.
This convention allows the following type of pipe to be set up:
query_database | find_vector | find_dups | \
| blast_sequence | pick_primer | mail_report
If all the programs in the pipe follow the conventions, then it will be
possible to interpose other programs, such as a repetitive element finder,
in the middle of the pipe without disturbing other components.
Here is a skeleton example.
#!/usr/local/bin/perl
# This is an example of pulling out a series of
# sequence records, picking primer pairs, and
# inserting the new STS into the stream.
use Boulder::Stream;
$stream = new Boulder::Stream;
while ($stone = $stream->read_record('SEQUENCE')) {
$primers = &pick_primers($stone->get('SEQUENCE'));
if ($primers) {
$stone->insert(PRIMER_OUTCOME=>1,
STS=>$primers);
} else {
$stone->insert(PRIMER_OUTCOME=>0);
}
} continue {
$stream->write_record($stone);
}
sub pick_primers {
my $sequence = shift;
return undef unless ($lstart,$llen,$rstart,$rlen) =
&call_primer_picking_algorithm($sequence);
# Create and return a new Stone to drop into the
# stream.
return new Stone(LEFT_START=>$lstart,
LEFT_LENGTH=>$llen,
RIGHT_START=>$rstart,
RIGHT_LENGTH=>$rlen);
}
As the example shows, the code starts by creating a Boulder::Stream
object to handle the I/O. It next enters a read/write loop in which
the tags the program is interested are returned one after another.
In the body of the loop we create a subrecord (a so-called Stone object)
containing a newly-chosen STS primer pair. This is now added to the
record with an STS tag, along with a PRIMER_OUTCOME tag indicating
whether primer picking was successful.
A full explanation of the objects and methods follows.
There are four classes defined in Boulder:
* Stone
This is a cute name for a data record. A Stone consists of a series
of tag/value pairs. Any given tag may appear once or more than once.
A value can be another Stone, allowing nested components. A big Stone
can be made up of a lot of little stones (pebbles?).
Stone was designed for subclassing. You should be able to create
subclasses which create or require particular tags and data formats.
However, in order for things to continue to work properly, all
subclasses should contain the Stone:: prefix, e.g.
Stone::Sequence, Stone::Image, Stone::RobotRun, etc.
* Boulder::Cursor
This class is an iterator over a Stone. Multiple
Stone::Cursor objects may be active simultaneously. Each one will
maintain its position independent of the others. These cursors do
not, however, tolerate insertions or deletions from their associated
Stone object, however, and will reset to the beginning of the data
structure if you try to modify the object while iterating over it.
* Boulder::Stream
This class handles I/O. It knows how to create Stone objects from
Boulderio input data streams, and how to write Stone objects out.
* Boulder::Store
This is a simple storage class for boulder. It uses the Berkeley dbm
interface to create files in which Stones can be accessed randomly.
It also supports quick indexed retrieval and queries.
Stone::New()
This is the constructor for the Stone class. It can be called without
any parameters, in which case it creates an empty Stone, or passed an
associative array (see Stone::insert) in order to initialize it with
a set of keys. Examples:
$myStone = new Stone;
$myStone = new Stone(name=>fred,age=>30);
Stone::insert(%hash)
This is the main method for adding tags to a Stone. This method expects
an associative array as an argument. The contents of the associative array
will be inserted into the Stone. If a particular tag is already present
in the Stone, the tag's current value will be appended to the list of
values for that tag. Several types of values are legal:
-
* A scalar value
-
The value will be inserted into the
Stone.
$stone->insert(name=>Fred,
age=>30,
sex=>M);
$stone->dump;
name[0]=Fred
age[0]=30
sex[0]=M
-
* An ARRAY reference
-
A multi-valued tag will be created:
$stone->insert(name=>Fred,
children=>[Tom,Mary,Angelique]);
$stone->dump;
name[0]=Fred
children[0]=Tom
children[1]=Mary
children[2]=Angelique
-
* A HASH reference
-
A subsidiary
Stone object will be created and inserted into the
object as a nested structure.
$stone->insert(name=>Fred,
wife=>{name=>Agnes,age=>40});
$stone->dump;
name[0]=Fred
wife[0].name[0]=Agnes
wife[0].age[0]=40
-
* A
Stone object or subclass
-
The
Stone object will be inserted into the object as a nested
structure.
$wife = new Stone(name=>agnes,
age=>40);
$husband = new Stone;
$husband->insert(name=>fred,
wife=>$wife);
$husband->dump;
name[0]=fred
wife[0].name[0]=agnes
wife[0].age[0]=40
.
Stone::replace(%hash)
The replace() method behaves exactly like insert() with the
exception that if the indicated key already exists in the Stone,
its value will be replaced. Use replace() when you want to enforce
a single-valued tag/value relationship.
Stone::insert_list($key,@list), Stone::insert_hash($key,%hash),
Stone::replace_list($key,@list), Stone::replace_hash($key,%hash)
These are primitives used by the insert() and replace() methods.
Override them if you need to modify the default behavior.
Stone::delete($key)
This removes the indicated key from the Stone.
Stone::get($tag,$index)
This returns the value at the indicated tag and optional index.
What you get depends on whether it is called in a scalar or list
context. In a list context, you will receive all the values for
that tag. You may receive a list of scalar values or (for a nested
record) or a list of Stone objects. If called in a scalar
context, you will either receive the first or the last member of
the list of values assigned to the tag. Which one you receive depends
on the value of the package variable . If
undefined, you will receive the first member of the list. If nonzero,
you will receive the last member.
You may provide an optional index in order to force get() to return a
particular member of the list. Provide a 0 to return the first member
of the list, or '#' to obtain the last member.
Stone::index($indexstr)
You can access the contents of even deeply-nested Stone objects
with the index method. You provide a tag path, and receive
a value or list of values back.
Tag paths look like this:
tag1[index1].tag2[index2].tag3[index3]
Numbers in square brackets indicate which member of a multivalued tag
you're interested in getting. You can leave the square brackets out
in order to return just the first or the last tag of that name, in a scalar
context (depending on the setting of ). In an
array context, leaving the square brackets out will return all
multivalued members for each tag along the path.
You will get a scalar value in a scalar context and an array value in
an array context following the same rules as get(). You can
provide an index of '#' in order to get the last member of a list or
a [?] to obtain a randomly chosen member of the list (this uses the rand() call,
so be sure to call srand() at the beginning of your program in order
to get different sequences of pseudorandom numbers. If
there is no tag by that name, you will receive undef or an empty list.
If the tag points to a subrecord, you will receive a Stone object.
Examples:
# Here's what the data structure looks like.
$s->insert(person=>{name=>Fred,
age=>30,
pets=>[Fido,Rex,Lassie],
children=>[Tom,Mary]},
person=>{name=>Harry,
age=>23,
pets=>[Rover,Spot]});
# Return all of Fred's children
@children = $s->index('person[0].children');
# Return Harry's last pet
$pet = $s->index('person[1].pets[#]');
# Return first person's first child
$child = $s->index('person.children');
# Return children of all person's
@children = $s->index('person.children');
# Return last person's last pet
$Stone::Fetchlast++;
$pet = $s->index('person.pets');
# Return any pet from any person
$pet = $s->index('person[?].pet[?]');
Note that index() may return a Stone object if the tag path
points to a subrecord.
Stone::tags()
Return all the tags in the Stone.
Stone::dump()
This is a debugging tool. Iterates through the Stone object and
prints out all the tags and values.
Example:
$s->dump;
person[0].children[0]=Tom
person[0].children[1]=Mary
person[0].name[0]=Fred
person[0].pets[0]=Fido
person[0].pets[1]=Rex
person[0].pets[2]=Lassie
person[0].age[0]=30
person[1].name[0]=Harry
person[1].pets[0]=Rover
person[1].pets[1]=Spot
person[1].age[0]=23
Stone::cursor()
Return an iterator over the Stone object. You can call this
several times in order to return independent iterators. See
Boulder::Cursor.
Boulder::Cursor::new($stone)
Return a new Boulder::Cursor over the specified Stone object.
This will return an error unless the object is a Stone or a
descendent.
Boulder::Cursor::each()
Iterate over the attached Stone. Each iteration will return a
two-valued list consisting of a tag path and a value. The tag path is
of a form that can be used with Stone::index() (in fact, a cursor
is used internally to implement the
Stone::dump()
method. When the
end of the Stone is reached, each() will return an empty list,
after which it will start over again from the beginning. If you
attempt to insert or delete from the stone while iterating over it,
all attached cursors will reset to the beginnning.
For example:
$cursor = $s->cursor;
while (($key,$value) = $cursor->each) {
print "$value: BOW WOW!\n" if $key=~/pet/;
}
Boulder::Cursor::reset()
This resets the cursor back to the beginning of the associated
Stone.
Boulder::Stream::new(IN,OUT)
The new() method creates a new Boulder::Stream object. You can
provide input and output filehandles. If you leave one or both
undefined new() will default to standard input or standard output.
You are free to use files, pipes, sockets, and other types of file
handles.
Boulder::Stream::read_record(@taglist)
Every time read_record() is called, it will return a new Stone
object. The Stone will be created from the input stream, using
just the tags provided in the argument list. Pass no tags to receive
whatever tags are present in the input stream.
If none of the tags that you specify are in the current boulder
record, you will receive an empty Stone. At the end of the input
stream, you will receive undef.
If called in an array context, read_record() returns a list of
all stones from the input stream that contain one or more of the
specified tags.
Boulder::Stream::write_record($stone)
Write a Stone to the output filehandle.
Useful State Variables in a Boulder::Stream
Every Boulder::Stream has several state variables that you can
adjust. Fix them in this fashion:
$a = new Boulder::Stream;
$a->{delim}=':';
$a->{record_start}='[';
$a->{record_end}=']';
$a->{passthru}=undef;
-
-
This is the delimiter character between tags and values, "=" by default.
-
-
This is the start of nested record character, "{" by default.
-
-
This is the end of nested record character, "}" by default.
-
-
This determines whether unrecognized tags should be passed through
from the input stream to the output stream. This is 'true' by
default. Set it to undef to override this behavior.
.
Boulder::Store::new("database/path",writable)
The new() method creates a new Boulder::Store object and
associates it with the database file provided in the first
parameter (undef is a valid pathname, in which case all methods
work but the data isn't stored). The second parameter should
be a true value if you want to open the database for writing.
Otherwise it's opened read only.
Because the underlying storage implementation is not multi-user,
only one process can have the database for writing at a time.
A fcntl()-based locking mechanism is used to give a process
that has the database opened for writing exclusive access to the
database. This also prevents the database from being opened
for reading while another process is writing to it (this is a
good thing). Multiple simultaneous processes can open the
database read only.
Physically the data is stored in a human-readable file
with the extension ".data".
Boulder::Store::read_record(@taglist)
The semantics of this call are exactly the same as in Boulder::Stream.
Stones are returned in sequential order, starting with the first record.
In addition to their built-in tags, each stone returned from this
call has an additional tag called "record_no". This is the zero-based
record number of the stone in the database. Use the reset() method
to begin iterating from the beginning of the database.
If called in an array context, read_record() returns a list of all
stones in the database that contains one or more of the provided tags.
Boulder::Store::write_record($stone)
This has the same semantics as Boulder::Stream. A stone is appended
to the end of the database. If successful, this call returns the
record number of the new entry.
Boulder::Store::get($record_no)
This is random access to the database. Provide a record number and
this call will return the stone stored at that position.
Boulder::Store::put($stone,$record_no)
This is a random write to the database. Provide a record number and
this call stores the stone at the indicated position, replacing whatever
was there before.
If no record number is provided, this call will look for the presence
of a 'record_no' tag in the stone itself and put it back in that position.
This allows you to pull a stone out of the database, modify it, and then
put it back in without worrying about its record number.
The record number of the inserted stone is returned from this call.
Boulder::Store::delete($stone)
Boulder::Store::delete($record_no)
These method calls delete a stone from the database. You can provide
either the record number or a stone containing the 'record_no' tag.
Warning: if the database is heavily indexed deletes can be
time-consuming as it requires the index to be brought back into
synch.
Boulder::Store::length()
This returns the length of the database, in records.
Boulder::Store::reset()
This resets the database, nullifying any queries in effect, and
causing read_record() to begin fetching stones from the first record.
Boulder::Store::query(%query_array)
This creates a query on the database used for selecting stones in
read_record(). The query is an associative array. Three types
of keys/value pairs are allowed:
-
-
This instructs Boulder::Store to look for stones containing the
specified tags in which the tag's value (determined by the Stone
index() method) exactly matches the provided
value. Example:
$db->query('STS.left_primer.length'=>30);
Only the non-bracketed forms of the index string are allowed (this
is probably a bug...)
If the tag path was declared to be an index, then this search
will be fast. Otherwise Boulder::Store must iterate over every
record in the database.
-
-
This instructs Boulder::Store to look for stones in which the
provided expression evaluates to true. When the expression
is evaluated, the variable will be set to the current
record's stone. As a shortcut, you can use <index.string>"
as shorthand for "$s->index('index.string')".
-
-
This lets you provide a whole bunch of expressions, and is exactly
equivalent to EVAL=>'(expression1) && (expression2) && (expression3)'.
You can mix query types in the parameter provided to query()..For example, here's how to look up all stones in which the sex is
male and the age is greater than 30:
$db->query('sex'=>'M',eval=>' > 30');
When a query is in effect, read_record() returns only Stones
that satisfy the query. In an array context, read_record()
returns a list of all Stones that satisfy the query. When no
more satisfactory Stones are found, read_record() returns
undef until a new query is entered or reset() is called.
Boulder::Store::add_index(@indices)
Declare one or more tag paths to be a part of a fast index.
read_record() will take advantage of this record when processing
queries. For example:
$db->add_index('age','sex','person.pets');
You can add indexes any time you like, when the database is first
created or later. There is a trade off: write_record(),
put(), and other data-modifying calls will become slower as
more indexes are added.
The index is stored in an external file with the extension ".index".
An index file is created even if you haven't indexed any tags.
Boulder::Store::reindex_all()
Call this if the index gets screwed up (or lost). It rebuilds it
from scratch.
Because the delim, record_start and record_end characters in the
Boulder::Stream object are used in optimized (once-compiled)
pattern matching, you cannot change these values once read_record()
has once been called. To change the defaults, you must create the
Boulder::Stream, set the characters, and only then begin reading
from the input stream. For the same reason, different
Boulder::Stream objects cannot use different delimiters.
the perl(1) manpage, the perlbot(1) manpage