The Picolisp database vs. ORMS.

This is an article adapted from a mailing list conversation on the subject. The original message was written by beneroth and I, mtsd, have only made slight adjustments.

The big fundamental difference between PicoLisp database architecture and ORMs is that in PicoLisp the application layer and the database layer is the same layer, it is not two very different conceptual systems (OOP vs. Relational) with a "translation system" in between.

This results in several benefits when developing applications using PicoLisp database:

Higher productivity

In PicoLisp, the database "table definition" and the application class is the same thing. In other stacks you have to do the "object definition" multiple times: in the database schema, an application class, usually you do additional a "business model" object (or even multiple). Every time it is basically the same "grouping of some properties/columns" with only small (but substantial) deviations.

Even when these frameworks generate you some or all of the extra code, it is in practice often necessary to look at this code and adjust it so it works as intended, especially when doing changes on an existing application.

Less code, less space for bugs

It is easier to analyze and debug what is happening with the data between input/storage and storage/output without having to track the values being copied from one intermediate (layer transition) object to the next.

Simpler mental model

As less screws and mechanics are present in the PicoLisp stack, it is easier to understand and imagine in your head what does happen, easier to contrive which potential effects you need to take care of.

In mainstream stacks, the conceptual and technical barrier between OOP application programming and relational SQL database programming is often very pronounced. In my experience, application programmers these days often have a very incomplete understanding of how a database works (e.g. they don't know about transactions), which results in the best case in insufficient usage of the DBMS (e.g. amateurish re-implementing of functionality in the application which would be available from the DBMS in a tested, optimized and reliable way), or widespread disregard of fundamental data integrity vulnerabilities in the worst case, which depending of the application might result in minor annoyances up to complete and utter failure of the whole thing (e.g. losing vital data, might bankrupt a company).

Queries and application cache

It is possible to write very involved and powerful queries on the application level in PicoLisp. These queries make use of the database data cached on the "application layer", meaning the database is only asked for records which were not already accessed in the same session.

Most (maybe all?) ORMs cannot take advantage of the records they have already cached, because the ORM has only a very limited implementation of SQL, so the query (e.g. written with LINQ on the application level) gets translated to SQL and sent to the database to be executed there, and then the results are sent back to the application - even data is sent forth and back which would already be cached on the application layer.

For example, Entity Framework only uses its own cache when directly fetching a record by primary key (id), all other queries are executed on the database as Entity Framework has no query engine.

This hurts performance even more when the database is located on a different server than the application, as it is often the case.

PicoLisp database is not relational

There is no "primary key" in PicoLisp

While you might often have a (rel id (+Key +Number)) to have an simple index to find all records of the same entity, such a property is completely optional.
A record only needs to be in at least one index or being linked to from another record, so it can be found and is not seen as garbage (as in garbage collecting object to which nothing points).

Every object has a global unique identifier

Global per database.
This is the name of the external symbol, which encodes the physical location in the database file where the record is stored.
This makes it possible to directly link a record to another (also many to many).
In relational databases this requires a foreign key (which is an index, roughly comparable to +Ref).
In relational databases, many-to-many relations require an interim table. This interim table is usually hidden by the ORM.

Lazy loading is possible by having a global unique identifier for every record

In relational databases, the "link" is storing the value of the primary key of another record.
Retrieving the complete record in a relational database means:
1. Find the corresponding entry in the specific primary key index.
2. From this the database retrieves the pointer to the physical location in the database file.
3. Read the record from the database file.
Retrieving the complete record in PicoLisp database means:
1. Read the record from the database file.

Scenario: you have a number of interlinked records (as natural in OOP), and you want to have them available on the application layer.

ORM:

Get a list of records from the database (or a single record)
Create for each of them an application object, copying all properties/fields from the database record to the application object.
When the property/field is a foreign key, retrieve the corresponding record from the database, create another application object and link it (by memory pointer) to the other related application object(s).
Repeat 2) and 3) for every such indirectly retrieved record because it was referred to (by foreign key) from the previous processed record.

This is eager loading, i.e. loading everything reachable from the "starting point" records returned by the query.

Look out for indirect circular relations - direct circular relations might be handled by the ORM, but indirect ones often aren't. Endless recursion until you get stack overflow within the ORM code may very well occur. (I just recently managed to do this by accident with an Entity Framework application.)

Obviously it often leads to "over"-loading:

Retrieving more records from the database than actually required/used for the current operation in the application.
Creating all those extra application objects and doing all this copying of property/field values.

ORMs do allow to customize this "automatic default loading", but:

This means extra work, extra source of bugs, ...
When looking at an application you haven't written yourself, you have to dive into this code to know what is loaded, when and how.

PicoLisp database:

Get a list of records from the database (or a single record).
When the property/field is a +Link (or +Joint), we do nothing special with it - the value just happens to be the name of an external symbol (or multiple symbols).
Only when the related record is accessed in the application is it retrieved from the database e.g. accessed via (get This address street name).

This is lazy loading, only load when we actually use it. Circular relations are no problem, doesn't matter how indirect/nested:

On first access the record is loaded.
On further access the already cached object is re-used.

PicoLisp database is a multi-paradigm database system

Direct support for OOP

Application objects can easily be persisted, without copying from/to another structure.
Inheritance and polymorphism can be applied to "tables" (as Entities are real OOP classes).

Each record can have it's own specific schema.

Same as "NoSQL" document databases (e.g. MongoDB).
Maintenance nightmare when used in typical business applications, but extremely useful for e.g. archive systems (old records may keep their old schema).

Records can be directly interlinked, and these links can be traversed easily and efficiently.

Same as in graph databases.

No limits on field (property) size or record size.

Well there are (high) limits eventually, but if hit them, you did most likely something very badly wrong in your design.
Text/strings can be of arbitrary lengths.
Numbers can be of arbitrary lengths.
So some rather typical issue of applications using SQL databases simply don't exist.

Example 1: my name/street/city/comment/whatever is 21 characters long but the database only allows 15.

Example 2: We have to change the primary key from an int to a long, because we have so many records we reached the size limit of the int.... (yep, this happens.)

This is really a even more advantageous feature than one might think at first. It really is very useful in practice, since typical applications require various changes to the application and the database whenever such an limit has to be increased. WHen this happens, all the ORM-generated code usually has to be adapted as well.

Apart from the ORM itself, there exists an entire class of software created to handle changes to the database, for example migrations in the Django framework, and Active Record migrations in Ruby on Rails. This adds both additional code and complexity. One part is added just to handle another part of the framework

Relational databases have limits on the maximum possible size of an individual record.

Practically, this means: the number of columns in a table is limited, as the combined maximal size of each column must not become higher than the record size limit.
No such limit in PicoLisp (as long the symbol fits into memory and you can still come up with names for all these properties...).

Support for various indexing strategies.

Unique index:
- +Key
Non-unique index:
- +Ref
Various text index strategies
- +Idx
- +IdxFold
- +Fold +Idx
- +Sn (soundex, find by phonetic similarity)
Index based on multiple properties (columns)
- +Aux
Spatial index
- +UB
- For geospatial or other data with multiple numeric dimensions.
And more..
- E.g. creating your own custom indexing mechanic is well possible in PicoLisp.
- Not so much in usual relational database systems.
- +Swap for lazy loading fields/properties:
  - In relational databases, you do this with a separate table and a 1-to-1 relation.
  - An ORM often doesn't know better than to do eager loading, so when using an ORM this method usually needs additional tweaking of the ORM.
  - 1-to-1 relations are often vehemently discouraged in SQL teachings(it violates the beuty of the relational model), but in practice often a meaningful method to improve performance by preventing "loading too much".
  - Fun fact: the founders of relational model (the theory applied in "relational" databases) argue that the current implementations do not adhere to the relational model, so these "relational" databases are not even really relational in pure theory.

Full ACID <https://en.wikipedia.org/wiki/ACID> support.

NoSQL databases usually lack it, basically shoveling the responsibility for data integrity to the application layer. Application programmers are often not aware of this and more than one company got bankrupt due to data loss.

Mainstream relational databases have full ACID support, but the default transaction guarantees are often rather weak, as they are optimized for performance benchmarks. Strong guarantees can be easily activated, if you know the right magic words, but the right magic words differ between different database management systems and the words themselves are often somewhat misleading. Support for transactions in ORMs is usually rather weak.

Block size configuration can be finely granulated.

See (dbs).

Most database systems store data in evenly sized slots in the database file. Filesystems (which are a kind of database system) work the same way (e.g. sectors on a harddrive). This allows quick jumps to a specific location within the database file, e.g. the Nth entry = N multiplied with the blocksize (same as pointer arithmetic).

This is in contrast to a text file where the lines have varying lengths separated by a linebreak character. So, to get to the Nth line, you have to go from the start through every character one by one, and count the linebreaks.

A database record typically occupies one or several such blocks but it can't occupy only half a block. Unused space in a block is overhead (e.g. many filesystems allow you to see the "real file size" vs. "size on disk", where "size on disk" is always equal or bigger = filesystem blocksize).

So you want the blocksize to fit the typical (median) size of your database records. Of course, this is specific to every application.

In most database systems you can have multiple database files (or "partitions"), each with a fixed blocksize. Tables get assigned to those files/partitions. PicoLisp databases work the same way.

In most database systems, this blocksize is configurable in the database software, by an expert for that database software, not by the typical programmer. Usually this is not adjusted to the use case, but the database software defaults are used

Most database systems start with blocksizes of 4kB or higher. Afaik this is usually the page-size of the filesystem and/or the OS caching of files

A table in a relational database is kinda one big array. In PicoLisp database the blocksize can be much smaller, as data is not stored in fixed arrays but in external symbols (random access).

In relational databases, a table usually has to be stored completely in one database file, as a consequence there is only one blocksize for the whole table

In PicoLisp, every record could be (created) in another database file. Moving a record later to another database file is not feasible, because this would change the name of its external symbol, so all references to that record would need to be changed. If a re-adjustment of the blocksize has to be done on a PicoLisp database, the usual way is to export and freshly import the data

If the blocksize is properly defined in *Dbs for every Entity, index and +Swap, then:

The PicoLisp database might have less storage overhead due to padding in ill-fitting blocks (in the database file) than a comparable relational database.
Automatic caching by the OS / filesystem might be more efficient (as more records fit into the cache).
This effect is highly depending on the size and usage patterns of the data (so all in all this point might be rather academic).

Support for Blobs (files, or other arbitrary binary data)

Picolisp does not store Blobs in the database files, but in separate files, referenced in the database. This is usually much more efficient than pressuring such Blobs into the database blocksize.

Such files can easily (and usually efficiently) be accessed but are not cached in the application/database cache. This can be an advantage or an disadvantage, depending on the use case.

Typically Blobs are used to store files which are not processed much by the application, just imported (or generated) and exported to the user.

In such cases, this mechanic for storing Blobs is clearly an advantage. In other scenarios such Blobs might be Bit-Tables used in the application, which should be cached. Though in PicoLisp such Bit-Tables could be stored as +Number, no need to use +Blob for this.

Disadvantages compared to relational databases

While PicoLisp database is well-suited for business applications, it is less ideal for storing huge amounts of strictly uniformly structured data (e.g. raw data for machine learning). Here it has two disadvantages compared to relational databases:

PicoLisp stores the property names for every non-NIL property on a record.

This is a small overhead per record and property, which might accumulate to unfavorable levels when storing huge amounts of similar data (similar to JSON, XML, ...)

Relational databases store data as arrays, so the field names are not stored on every record (similar to CSV) This is why Alex uses very short property names (e.g. "nm" instead of "name"). Readability of the code is a habit thing.

External symbols are good for random access, not so good for bulk read/export of data.

This point is also valid for bulk insert, to a degree. A relational database stores data similar to an array - in continuing blocks (per table).

Picolisp database stores data in external symbols external symbols stored next to each other might be completely unrelated (except same blocksize).

When a database query happens to request the data exactly in the order as it is physically stored in the relational database (usually by primary key, Microsoft calls this the "clustered" key), the relational database can very efficiently just dump the whole part of the database file to the client, without having to do any (or much) processing of that data.

Using an ORM probably ruins this advantage, as the ORM is processing each record individually and copying the data into application objects.

Conclusion

I hope I could give here a good insight how PicoLisp database compares to mainstream database stacks. In the end you have to decide what fits best your requirements and the nature of the application/data.

I've worked now several years in parallel on business applications implemented in C#/MVC .NET/MSSQL and in PicoLisp, and observe much higher productivity and maintainability (flexibility, ease of doing changes on an existing application) and better performance (for the application user) in the PicoLisp stack - but of course it also depends on the actual software architecture and database design used.

And of course PicoLisp programming requires capable developers who are interested to dig into the stuff, while the popularity and the "hand holding"-features/restrictions of those mainstream enterprise stacks make a kind of minimum productivity possible even when no capable developer is available.

https://picolisp.com/wiki/?pilvsorm

09dec19

mtsd