Inserts and queries – a new approach

Queries that are expressed in natural languages have nearly the same structure and use the same terminology as statements in natural languages. Thus the language used for statements as well as the variant that is used for questions are only slightly different. The differences, such as changes in word sequences and the use of question marks, can even be eliminated completely when the ‘Speech act’ theory of John Searl is applied. Searl demonstrated that the expressions can be made identical by explicit mentioning the intention of an expression. For example:

Statement:       book B-1 has a price of 110 dollar
Question:        book B-1 has a price of 110 dollar
Confirmation:  book B-1 has a price of 110 dollar

Because the second expression shall be interpreted as a question, it is clear that its asks for a confirmation or a denial. Note that a confirmations is equivalent to the answer ‘yes’ or ‘indeed’. Queries on conventional database content are usually expressed in dedicated query languages, such as SQL and SPARQL. The 'Speech act' theory and other conventions open up the possibility to use the same formalized language for statements (in exchanges messages or data stores) as well as for insertion commands, for queries and for responses (answers), as is described below.

SQL assumptions and limited semantics

Query languages such as SQL are independent of database structures and used language for the content, as long as the structure is tabular. do not put any requirements on the tables in the database, neither on their names nor on their columns and column names. This freedom is required because a lack of standardization causes that different databases are composed of different tables, with other names and other numbers of columns with different names, even when they contain the same information about the same kinds of things. Thus INSERTs as well as SELECT statements will be different for each database. This freedom also implies that these languages presuppose that authors of inserts and queries have knowledge about the internal structure (syntax) of the queried database as well as of the used terminology for table columns and table content. Thus, SQL and other query languages themselves do not deal with any meaning (semantics) of the columns of the tables and are also independent of the terminology that is used for the content of the tables. They are generic languages that have a minimum of semantics. For example, the following simple insertion in database table called ‘Book’ is written in SQL (adapted from http://en.wikipedia.org/wiki/SQL):

INSERT INTO Book
 (title, price, type)
 VALUES
 (‘B-1’, 110)
 (‘B-2’, 120, paperback)
;

Apparently the author of this insert knows that there exists already a table, called ‘Book’ that has at least three columns, called title, price, and type, whereas he also knows that a title is the title of a book and a price on the same row is the net selling price (in dollar) of a single copy of a book with that title and a type denotes a subtype of book that classifies the book on that same row. He also introduces a new free term in the vocabulary of the content, unless the term ‘paperback’ is a predefined allowed value for ‘type’.

The same content could however be stored in a databases that have different definitions. For example, the database ‘Product’, with columns that have names such as ‘name’, ‘nett price’, ‘product type’. Then the INSERT would have been different, although the content would be the same.

Data can be retrieved from such database tables with queries that are also expressed in such query languages. For example the following select from the ‘Book’ table is also expressed in SQL:

SELECT *

 FROM  Book

 WHERE price > 100.00 ;

This simple query apparently assumes the same knowledge from the author as is required for the insertion.

The same query would be different when it was a query on the ‘Product’ database. Furthermore, the expressions for insertion of information are significantly different from expressions of a query about the same information and those expressions are again different from expressions that present the results of a query.

Equivalent inserts and selects in Formal English

This is quite different when Formal English is adopted. One of the advantages of Formal English expressions is that the structure of expressions as well as their vocabulary are all the same. This is achieved by the rule that all Formal English expressions have the same expression components (thus they can all be stored in one standard table) and all Formal English expressions use the same (extensible) dictionary-taxonomy with predefined concepts, such as book, title, price and paperback. Furthermore, the expressions for insertion are similar to the expressions of queries and to stored and exchanged information.

Thus Formal English insertions and queries are only determined by the semantics and are not determined by the many possible database structures, and they are independent on the variety of terminology that is used in current practices for names of entity types and names of attribute types.

This means that information that is expressed in Formal English can be inserted in any database table that is based on Formal English or that has import and export mapping to Formal English expressions (within access and requirements constraints). Note that Formal English expression tables are not just for books only!. Thus the system independent expressions don’t need to be rewritten for other databases. Table 1 presents an example of an insert command for information about the prices of two books with expressions in Formal English that are database independent.

Intention

UID of left hand object

Name of left hand object

Name of relation type

UID of right hand object

Name of right hand object

UoM

command

195070

insert

the following expressions into

101

ABC

 

statement

102

B-1

is classified as a

490023

book

 

statement

102

B-1

has as aspect

103

P-1 of B-1

 

statement

103

P-1 of B-1

is classified as a

550742

price

 

statement

103

P-1 of B-1

has on scale a value equal to

920366

110

$

statement

104

B-2

is classified as a

493755

paperback

 

statement

104

B-2

has as aspect

105

P-1 of B-2

 

statement

105

P-1 of B-2

is classified as a

550742

price

 

statement

105

P-1 of B-2

has on scale a value equal to

920376

120

$

command

193423

terminate

the execution of

195070

insert

 

Table 1, Insert product data in table ABC

Note: Table 1 only shows a subset of the Gellish Expression Format. In a full Gellish Expression Format each line has more UID’s and contextual facts, such as the validity period, status, originator, etc. This enables e.g. to add multiple prices in various currencies and each with its own validity time period, if the cardinality constraints allow for that.

The body of Table 1, without the first and the last line can be copied exactly into a database table, such as ABC, because the storage table has an identical table structure.

The above query in SQL is expressed in Formal English as follows:

Intention

UID of left hand object

Name of left hand object

Name of relation type

UID of right hand object

Name of right hand object

UoM

command

193617

select

the following expressions from

101

ABC

 

question

1

?Book-1

is classified as a

490023

book

 

question

1

?Book-1

has as aspect

2

?Price-1

 

question

2

?Price-1

is classified as a

550742

price

 

question

2

?Price-1

has on scale a value greater than

920053

100

$

command

193423

terminate

the execution of

193617

select

 

Table 2, Query table ABC in the form of a product model

Comparison of Table 1 with Table 2 shows the similarity of the models, which demonstrates that the expression of information and the expression of queries can be done in the same language. Thus there is no need for a dedicated query language.

Note 1: Formally the names of the unknowns are free, although it is recommended to use terms such as ‘what’ and ‘which’ and ‘who’ (possibly followed by a sequence number) or terms that start with a question mark (?), as is an SPARQL convention. The reason why the names are free is to enable search on string commonalities (see below). This freedom is enabled by the fact that, according to the Gellish Formal English convention, all unknowns shall be represented by UID’s that are numbers in the range 1 to 99. Thus each query can have a maximum of 99 variables. Remember that all terms denote concepts that are represented by UID’s., although those UID’s are not shown in Table 1 and Table 2 to limit the widths of the tables.
Thus, in practice the expressions, the left and right hand objects and the kinds of relations all have unique identifiers (UID’s).

Note 2: A query may search for and select from more than one table at the same time (because the various tables have the same definition). Tables do not need to be JOINed, only the search results should then be presented to the user as a combined result.

The query that is expressed in Table 2 illustrates that software should take the taxonomy of concepts into account. For example, the dictionary-taxonomy specifies that the concept paperback is a subtype of book. If the software processes that information correctly, then a query on book will also find the paperbacks. This hierarchy enables to simply modify the query to search e.g. on paperbacks only or on any other subtype.

In SQL and asterisk (*) can be used to specify that ‘all’ attributes from a table should be reported. This assumes that the authors knows which attributes are in the table, but when there is information about the books in other tables the query becomes more complicated. In the Gellish Formal English approach the kinds of relations that are queried can be specified more precisely. For example the query in Table 2 uses relation type <has as aspect> and thus it only asks for aspects, whereas on the following line it is specified that only aspects are required for which holds that the aspect <is classified as a> price. The query can easily be extended with additional requests for other information, such as  

question

What-1

is located in

Some location-1

 

question

Some location-1

is classified as a

building

 

 Or with the very generic question:

question

What-1

is related to

A-1

 

question

A-1

is classified as a

anything

 

This latest question asks for everything that is known about the books.

Further data manipulation constructs for operations on the search results, including operations on the contextual facts, are outside the scope of this paper.

Search string commonalities

One of the reasons to leave the names of the variables free is to enable to specify character strings that only partially match with names that are searched for. The name of an unknown can be specified as P-1, whereas the intention is to search for things that have a name that starts with P-1, so that e.g. P-101 and P-1201, etc. are included in the search result. To enable specifying to what extent a search string shall match with target strings, there are two additional components available in a Gellish Expression Format, being a left hand and a right hand string commonality. For example, in case of the above search on P-1 it will be ‘case insensitive front end identical’.

Further details about these string commonalities are described in the book ‘Semantic Modeling in Formal English’.

SPARQL and RDF

SPARQL is a query language that is especially made for querying databases that are formatted conform RDF, also called ‘triple stores’. As shown above, the semantics of questions and other expressions require more than just triples, such as units of measure and contextual facts. That is the reason why many implementation specify extensions of RDF to represent collections of triples, which are called ‘named graphs’ as is also applied in ISO 15926-11, which standardizes an RDF implementation of Gellish Formal English.

Extended RDF implementations of Formal English can use SPARQL directly. However, RDF itself defined a syntax and a minimum of semantics (it only defined a few concepts), just as SQL. This enables that in RDF expressions any relation type (‘predicate’ in RDF) and any left hand and right hand term (‘subject’ and ‘object’ in RDF) can be used. Thus everybody can use his or her own ‘namespace’ and own ontology. This powerful flexibility at the same time reveals the weakness towards interoperability, because of the lack of standardization of the language in which the database, message and query contents can or shall be expressed.

The expression of inserts and queries can be made database system independent only when an extended RDF is combined with a semantically rich language, such as Formal English. Such a combination provides a language that includes semantics as well as syntax (format).

Another question is whether the SPARQL syntax is to be preferred above the tabular Gellish Expression Format syntax as is used in Table 1 and Table 2. The commonalities and differences between these two formats can be illustrated on the SPARQL example query for a ‘foaf’ (friend of a friend) database http://en.wikipedia.org/wiki/SPARQL:

PREFIX foaf: <http://xmlns.com/foaf/spec/>
SELECT ?name ?email
WHERE {
  ?person a foaf:Person.
  ?person foaf:name ?name.
  ?person foaf:mbox ?email.
}

The above example shows that SPARQL also presupposes knowledge about the particular structure of the queries database. Although the structure of RDF expressions is database (data model) independent, this example demonstrates that this query is dependent on the structure of the foaf database and relies on the understanding of the content of the foaf ontology (http://xmlns.com/foaf/spec/), which includes a database structure (table definitions) with definitions of ‘classes’ (entity types) that have pre-defined ‘properties’ (attribute types). For example the class foaf:Person is not the same as the generic concept ‘person’, because the foaf ontology defines a foaf:Person as a person that has a number of predefined ‘properties’ (attributes) with specific names. Thus a foaf:Person is defined as a particular collection of ‘properties’. For example the foaf ontology pre-defines that a foaf:Person can have or has a surname, as well as e.g. publications and a currentProject, and inherits an mbox. Apparently the foaf ontology defines a very specific ‘language’ that cannot be merged with other ontologies/languages and thus the query will only work on a foaf database and shall be rewritten for any other database.

This demonstrates why the neutral form of Formal English expressions in Table 1 and 2 has advantages.