NAME

Boulder.pm - A semantic free data interchange format


Download the distribution (tar archive)

DESCRIPTION

Boulder IO

Boulder IO is a simple TAG=VALUE data format designed for sharing data between programs connected via a pipe. It is also simple enough to use as a common data exchange format between databases, Web pages, and other data representations.

The basic data format is very simple. It consists of a series of TAG=VALUE pairs separated by newlines. It is record-oriented. The end of a record is indicated by an empty delimiter alone on a line. The delimiter is "=" by default, but can be adjusted by the user.

An example boulder stream looks like this:

NAME=AFM036YB2 SEQUENCE=GATCCGAGATACAGAGGATCCCACACACACACACAC.... LEFT_START=0 LEFT_LENGTH=14 RIGHT_START=140 RIGHT_LENGTH=20 = NAME=MR1239 SEQUENCE=GGGATCTTTCTCCACCTCCAGCGCACAGAGCAGGGGG... LEFT_START=2 LEFT_LENGTH=20 RIGHT_START=130 RIGHT_LENGTH=18 ALIAS=WI-1000 ALIAS=XB2023 =

Notes:

Extended Boulder Format

The simple boulder format has been extended to accomodate nested relations and other intresting structures. The new format allows nested records to be created in this way:

NAME=AFM036YB2 SEQUENCE=GATCCGAGATACCCCACACACACACACAC.... STS={ LEFT_START=0 LEFT_LENGTH=14 RIGHT_START=140 RIGHT_LENGTH=20 } = NAME=MR1239 SEQUENCE=GGGATCTTTCTCCACCTTGGAGAGCAGGGGG... STS={ LEFT_START=2 LEFT_LENGTH=20 RIGHT_START=130 RIGHT_LENGTH=18 } ALIAS=WI-1000 ALIAS=XB2023 =

As in the original format, tags may be multivalued. For example, there might be several STS records assigned to a sequence. Each subrecord may contain further subrecords.

Pass through behavior

A major attribute of the boulderio protocol is the convention that programs reading boulderio streams should only remove from the stream those tags that they're interested in. Any unrecognized tags are passed through to other programs that might be interested. In fact, most programs will want to put the tags back into the boulder stream once they're finished, potentially adding their own. Of course some programs will want to behave differently. For example, a database query program will generate but not read a boulderio stream, while a report generator will read but not write the stream.

This convention allows the following type of pipe to be set up:

query_database | find_vector | find_dups | \ | blast_sequence | pick_primer | mail_report

If all the programs in the pipe follow the conventions, then it will be possible to interpose other programs, such as a repetitive element finder, in the middle of the pipe without disturbing other components.


SKELETON BOULDER PROGRAM

Here is a skeleton example.

#!/usr/local/bin/perl # This is an example of pulling out a series of # sequence records, picking primer pairs, and # inserting the new STS into the stream. use Boulder::Stream; $stream = new Boulder::Stream; while ($stone = $stream->read_record('SEQUENCE')) { $primers = &pick_primers($stone->get('SEQUENCE')); if ($primers) { $stone->insert(PRIMER_OUTCOME=>1, STS=>$primers); } else { $stone->insert(PRIMER_OUTCOME=>0); } } continue { $stream->write_record($stone); } sub pick_primers { my $sequence = shift; return undef unless ($lstart,$llen,$rstart,$rlen) = &call_primer_picking_algorithm($sequence); # Create and return a new Stone to drop into the # stream. return new Stone(LEFT_START=>$lstart, LEFT_LENGTH=>$llen, RIGHT_START=>$rstart, RIGHT_LENGTH=>$rlen); }

As the example shows, the code starts by creating a Boulder::Stream object to handle the I/O. It next enters a read/write loop in which the tags the program is interested are returned one after another. In the body of the loop we create a subrecord (a so-called Stone object) containing a newly-chosen STS primer pair. This is now added to the record with an STS tag, along with a PRIMER_OUTCOME tag indicating whether primer picking was successful.

A full explanation of the objects and methods follows.


CLASSES

There are four classes defined in Boulder:

* Stone
This is a cute name for a data record. A Stone consists of a series of tag/value pairs. Any given tag may appear once or more than once. A value can be another Stone, allowing nested components. A big Stone can be made up of a lot of little stones (pebbles?).

Stone was designed for subclassing. You should be able to create subclasses which create or require particular tags and data formats. However, in order for things to continue to work properly, all subclasses should contain the Stone:: prefix, e.g. Stone::Sequence, Stone::Image, Stone::RobotRun, etc.

* Boulder::Cursor
This class is an iterator over a Stone. Multiple Stone::Cursor objects may be active simultaneously. Each one will maintain its position independent of the others. These cursors do not, however, tolerate insertions or deletions from their associated Stone object, however, and will reset to the beginning of the data structure if you try to modify the object while iterating over it.

* Boulder::Stream
This class handles I/O. It knows how to create Stone objects from Boulderio input data streams, and how to write Stone objects out.

* Boulder::Store
This is a simple storage class for boulder. It uses the Berkeley dbm interface to create files in which Stones can be accessed randomly. It also supports quick indexed retrieval and queries.


METHODS

Stone Methods

Stone::New()
This is the constructor for the Stone class. It can be called without any parameters, in which case it creates an empty Stone, or passed an associative array (see Stone::insert) in order to initialize it with a set of keys. Examples:

$myStone = new Stone; $myStone = new Stone(name=>fred,age=>30);

Stone::insert(%hash)
This is the main method for adding tags to a Stone. This method expects an associative array as an argument. The contents of the associative array will be inserted into the Stone. If a particular tag is already present in the Stone, the tag's current value will be appended to the list of values for that tag. Several types of values are legal:

* A scalar value
The value will be inserted into the Stone.

$stone->insert(name=>Fred, age=>30, sex=>M); $stone->dump; name[0]=Fred age[0]=30 sex[0]=M

* An ARRAY reference
A multi-valued tag will be created:

$stone->insert(name=>Fred, children=>[Tom,Mary,Angelique]); $stone->dump; name[0]=Fred children[0]=Tom children[1]=Mary children[2]=Angelique

* A HASH reference
A subsidiary Stone object will be created and inserted into the object as a nested structure.

$stone->insert(name=>Fred, wife=>{name=>Agnes,age=>40}); $stone->dump; name[0]=Fred wife[0].name[0]=Agnes wife[0].age[0]=40

* A Stone object or subclass
The Stone object will be inserted into the object as a nested structure.

$wife = new Stone(name=>agnes, age=>40); $husband = new Stone; $husband->insert(name=>fred, wife=>$wife); $husband->dump; name[0]=fred wife[0].name[0]=agnes wife[0].age[0]=40

.

Stone::replace(%hash)
The replace() method behaves exactly like insert() with the exception that if the indicated key already exists in the Stone, its value will be replaced. Use replace() when you want to enforce a single-valued tag/value relationship.

Stone::insert_list($key,@list), Stone::insert_hash($key,%hash),
Stone::replace_list($key,@list), Stone::replace_hash($key,%hash)

These are primitives used by the insert() and replace() methods. Override them if you need to modify the default behavior.

Stone::delete($key)
This removes the indicated key from the Stone.

Stone::get($tag,$index)
This returns the value at the indicated tag and optional index. What you get depends on whether it is called in a scalar or list context. In a list context, you will receive all the values for that tag. You may receive a list of scalar values or (for a nested record) or a list of Stone objects. If called in a scalar context, you will either receive the first or the last member of the list of values assigned to the tag. Which one you receive depends on the value of the package variable . If undefined, you will receive the first member of the list. If nonzero, you will receive the last member.

You may provide an optional index in order to force get() to return a particular member of the list. Provide a 0 to return the first member of the list, or '#' to obtain the last member.

Stone::index($indexstr)
You can access the contents of even deeply-nested Stone objects with the index method. You provide a tag path, and receive a value or list of values back.

Tag paths look like this:

tag1[index1].tag2[index2].tag3[index3]

Numbers in square brackets indicate which member of a multivalued tag you're interested in getting. You can leave the square brackets out in order to return just the first or the last tag of that name, in a scalar context (depending on the setting of ). In an array context, leaving the square brackets out will return all multivalued members for each tag along the path.

You will get a scalar value in a scalar context and an array value in an array context following the same rules as get(). You can provide an index of '#' in order to get the last member of a list or a [?] to obtain a randomly chosen member of the list (this uses the rand() call, so be sure to call srand() at the beginning of your program in order to get different sequences of pseudorandom numbers. If there is no tag by that name, you will receive undef or an empty list. If the tag points to a subrecord, you will receive a Stone object.

Examples:

# Here's what the data structure looks like. $s->insert(person=>{name=>Fred, age=>30, pets=>[Fido,Rex,Lassie], children=>[Tom,Mary]}, person=>{name=>Harry, age=>23, pets=>[Rover,Spot]}); # Return all of Fred's children @children = $s->index('person[0].children'); # Return Harry's last pet $pet = $s->index('person[1].pets[#]'); # Return first person's first child $child = $s->index('person.children'); # Return children of all person's @children = $s->index('person.children'); # Return last person's last pet $Stone::Fetchlast++; $pet = $s->index('person.pets'); # Return any pet from any person $pet = $s->index('person[?].pet[?]');

Note that index() may return a Stone object if the tag path points to a subrecord.

Stone::tags()
Return all the tags in the Stone.

Stone::dump()
This is a debugging tool. Iterates through the Stone object and prints out all the tags and values.

Example:

$s->dump; person[0].children[0]=Tom person[0].children[1]=Mary person[0].name[0]=Fred person[0].pets[0]=Fido person[0].pets[1]=Rex person[0].pets[2]=Lassie person[0].age[0]=30 person[1].name[0]=Harry person[1].pets[0]=Rover person[1].pets[1]=Spot person[1].age[0]=23

Stone::cursor()
Return an iterator over the Stone object. You can call this several times in order to return independent iterators. See Boulder::Cursor.

Boulder::Cursor Methods

Boulder::Cursor::new($stone)
Return a new Boulder::Cursor over the specified Stone object. This will return an error unless the object is a Stone or a descendent.

Boulder::Cursor::each()
Iterate over the attached Stone. Each iteration will return a two-valued list consisting of a tag path and a value. The tag path is of a form that can be used with Stone::index() (in fact, a cursor is used internally to implement the Stone::dump() method. When the end of the Stone is reached, each() will return an empty list, after which it will start over again from the beginning. If you attempt to insert or delete from the stone while iterating over it, all attached cursors will reset to the beginnning.

For example:

$cursor = $s->cursor; while (($key,$value) = $cursor->each) { print "$value: BOW WOW!\n" if $key=~/pet/; }

Boulder::Cursor::reset()
This resets the cursor back to the beginning of the associated Stone.

Boulder::Stream Methods

Boulder::Stream::new(IN,OUT)
The new() method creates a new Boulder::Stream object. You can provide input and output filehandles. If you leave one or both undefined new() will default to standard input or standard output. You are free to use files, pipes, sockets, and other types of file handles.

Boulder::Stream::read_record(@taglist)
Every time read_record() is called, it will return a new Stone object. The Stone will be created from the input stream, using just the tags provided in the argument list. Pass no tags to receive whatever tags are present in the input stream.

If none of the tags that you specify are in the current boulder record, you will receive an empty Stone. At the end of the input stream, you will receive undef.

If called in an array context, read_record() returns a list of all stones from the input stream that contain one or more of the specified tags.

Boulder::Stream::write_record($stone)
Write a Stone to the output filehandle.

Useful State Variables in a Boulder::Stream
Every Boulder::Stream has several state variables that you can adjust. Fix them in this fashion:

$a = new Boulder::Stream; $a->{delim}=':'; $a->{record_start}='['; $a->{record_end}=']'; $a->{passthru}=undef;

This is the delimiter character between tags and values, "=" by default.

This is the start of nested record character, "{" by default.

This is the end of nested record character, "}" by default.

This determines whether unrecognized tags should be passed through from the input stream to the output stream. This is 'true' by default. Set it to undef to override this behavior.

.

Boulder::Store Methods

Boulder::Store::new("database/path",writable)
The new() method creates a new Boulder::Store object and associates it with the database file provided in the first parameter (undef is a valid pathname, in which case all methods work but the data isn't stored). The second parameter should be a true value if you want to open the database for writing. Otherwise it's opened read only.

Because the underlying storage implementation is not multi-user, only one process can have the database for writing at a time. A fcntl()-based locking mechanism is used to give a process that has the database opened for writing exclusive access to the database. This also prevents the database from being opened for reading while another process is writing to it (this is a good thing). Multiple simultaneous processes can open the database read only.

Physically the data is stored in a human-readable file with the extension ".data".

Boulder::Store::read_record(@taglist)
The semantics of this call are exactly the same as in Boulder::Stream. Stones are returned in sequential order, starting with the first record. In addition to their built-in tags, each stone returned from this call has an additional tag called "record_no". This is the zero-based record number of the stone in the database. Use the reset() method to begin iterating from the beginning of the database.

If called in an array context, read_record() returns a list of all stones in the database that contains one or more of the provided tags.

Boulder::Store::write_record($stone)
This has the same semantics as Boulder::Stream. A stone is appended to the end of the database. If successful, this call returns the record number of the new entry.

Boulder::Store::get($record_no)
This is random access to the database. Provide a record number and this call will return the stone stored at that position.

Boulder::Store::put($stone,$record_no)
This is a random write to the database. Provide a record number and this call stores the stone at the indicated position, replacing whatever was there before.

If no record number is provided, this call will look for the presence of a 'record_no' tag in the stone itself and put it back in that position. This allows you to pull a stone out of the database, modify it, and then put it back in without worrying about its record number.

The record number of the inserted stone is returned from this call.

Boulder::Store::delete($stone)
Boulder::Store::delete($record_no)
These method calls delete a stone from the database. You can provide either the record number or a stone containing the 'record_no' tag. Warning: if the database is heavily indexed deletes can be time-consuming as it requires the index to be brought back into synch.

Boulder::Store::length()
This returns the length of the database, in records.

Boulder::Store::reset()
This resets the database, nullifying any queries in effect, and causing read_record() to begin fetching stones from the first record.

Boulder::Store::query(%query_array)
This creates a query on the database used for selecting stones in read_record(). The query is an associative array. Three types of keys/value pairs are allowed:

This instructs Boulder::Store to look for stones containing the specified tags in which the tag's value (determined by the Stone index() method) exactly matches the provided value. Example:

$db->query('STS.left_primer.length'=>30);

Only the non-bracketed forms of the index string are allowed (this is probably a bug...)

If the tag path was declared to be an index, then this search will be fast. Otherwise Boulder::Store must iterate over every record in the database.

This instructs Boulder::Store to look for stones in which the provided expression evaluates to true. When the expression is evaluated, the variable will be set to the current record's stone. As a shortcut, you can use <index.string>" as shorthand for "$s->index('index.string')".

This lets you provide a whole bunch of expressions, and is exactly equivalent to EVAL=>'(expression1) && (expression2) && (expression3)'.

You can mix query types in the parameter provided to query()..For example, here's how to look up all stones in which the sex is male and the age is greater than 30:

$db->query('sex'=>'M',eval=>'<age> > 30');

When a query is in effect, read_record() returns only Stones that satisfy the query. In an array context, read_record() returns a list of all Stones that satisfy the query. When no more satisfactory Stones are found, read_record() returns undef until a new query is entered or reset() is called.

Boulder::Store::add_index(@indices)
Declare one or more tag paths to be a part of a fast index. read_record() will take advantage of this record when processing queries. For example:

$db->add_index('age','sex','person.pets');

You can add indexes any time you like, when the database is first created or later. There is a trade off: write_record(), put(), and other data-modifying calls will become slower as more indexes are added.

The index is stored in an external file with the extension ".index". An index file is created even if you haven't indexed any tags.

Boulder::Store::reindex_all()
Call this if the index gets screwed up (or lost). It rebuilds it from scratch.


BUGS

Because the delim, record_start and record_end characters in the Boulder::Stream object are used in optimized (once-compiled) pattern matching, you cannot change these values once read_record() has once been called. To change the defaults, you must create the Boulder::Stream, set the characters, and only then begin reading from the input stream. For the same reason, different Boulder::Stream objects cannot use different delimiters.


SEE ALSO

the perl(1) manpage, the perlbot(1) manpage