Semantic consistency verification

Databases and data exchange files should be verified on the semantic consistency of their content. Semantic consistency verification checks whether collections of expressions contain redundant statements or statements that are in conflict with other statements. The verification possibilities and processes for the content of conventional databases, such as in relational (entity-attribute-relationship) data model instances, differ from the verification possibilities and processes for databases which content is expressed in a formalized natural language that is based on a defined taxonomy of concepts, such as Gellish Formal English. This article discusses the verification of several semantic consistency rules and constraints in formalized natural languages.

Prerequisite: known typology (mandatory classification and generalization)

Kinds of relations define the kinds of things that are allowed to be related by such a relation. Thus related things shall be of kinds that comply with the definitions of the kinds of relations in which they are involved. Semantic verification therefore has two prerequisites: classification of individual things and generalization of kinds of things.

Classification of individual things

The first prerequisite is that individual things shall be of kinds that are allowed by the relations in which they are involved. This implies that the correct use of relations of particular kinds can only be verified when from all individual things it is known of which kind(s) they are. In other words:
- Every individual thing (including also individual aspects, properties and activities and individual relations) shall be classified at least once by a kind.
In semantic databases and data exchange messages that apply a formalized language, the requirement for classification of individual things implies that explicit classification relations are required. For example, by a statement such as: P1 <is classified as a> pump. Such explicit classification relations allows that the classifying kinds are selected from any kind in the dictionary of the language, including any of many subtypes. Thus classification is not restricted to instantiation in available attribute types. Furthermore, multiple explicit classification relations for the same individual things are possible and simple in a fomalized language, and all things can freely appear in many relations of various kinds, without the constraint that the related things shall be classified by the same kind. This means that classification in a formalized natural language is more flexible than classification in conventional databases and conventional data exchange files.

Generalization of kinds of things

Knowledge, requirements and constraints are typically expressed as relations between kinds of things. When those kinds of things are arranged in a subtype-supertype hierarchy (taxonomy), then the knowledge, requirements and constraints that are expressed for particular kinds are inherited to all the subtypes of those kinds. When individual things are classified by one of those subtypes, then this means that all the knowledge, requirements and constraints that are specified for its supertypes is inherited to the classifying subtype and thus is applicable for the classified individual thing.
Thus, to enable the verification whether individual things satisfy the knowledge, requirements and constraints, it is required that such a taxonomy is known. The second prerequisite for this verification is therefore that for all kinds of things (the classifiers of the individual things) mandatory generalization relations are required. In other words:
- Every kind (concept), including also every kind of relation, shall be defined by a specialization relation as a subtype of one or more supertype concepts (more generalized concepts).

Thus specialization-generalization relations between concepts enable that knowledge, requirements, constraints and definitions are inherited from supertype concepts to their subtypes, whereas the classification relations determine the individual things for which that knowledge and those requirements and constraints are applicable. Those are the reasons why classification relations and specialization-generalization relations are prerequisites for semantic verification.
Note 1: Formalized English can be applied without these mandatory relations, but then many semantic verifications are impossible.
Note 2: Conventional data models usually do not include a taxonomy (subtype-supertype hierarchy) of kinds. Thus then there is no taxonomy available for the determination whether specified knowledge, requirements and constraints are applicable for individual things.

For example, an individual thing that is denoted as Paris is used in an expression in a formalized language, such as ‘the Eiffel tower <is located in> Paris’. Thus the Eiffel tower and Paris are incorporated in the language. However, but the validity of their involvement in relations and the applicability of knowledge, requirements and constraints can only be verified when they are defined by classification relations. Thus, for example. Paris is classified as a city. The validity of that classification relation can only be verified when the concept city is a valid classifier. This can be determined by it being defined by a specialization relation as a subtype of the more generic concept that is by definition allowed to be a classifier in a classification relation. Thus the concept city should become an element in a subtype-supertype hierarchy (taxonomy). Some concepts in that hierarchy will appear in definitions of other kinds of relations as being allowed as players of roles of various kinds in such relations. For example, physical objects are allowed to play a role as ‘locator’ in <is located in> relations. This allowed role player role is inherited via the taxonomy e.g. by the concept 'city'. The classification of Paris as a city thus specifies that statements such as ‘the Eiffel tower <is located in> Paris’ are valid statements.

To enable the maintenance of a semantically consistent database, the following rule is applicable: If all classifications of an individual thing or all specialization relations of a kind are deleted or terminated (historicized), then that thing terminates its existence. In other words: then its being represented and being known in the language is terminated. Such a termination implies that all binary relations in which the thing appears shall be deleted or terminated (historicized) as well. This may cause a chain of deletions and terminations.

Redundancy – unnecessary duplicates

In Gellish Formalized English every thing shall have its own unique identifier (UID's). Thus one thing may not have multiple UID's, although every thing may have multiple names (synonyms). (Note that e.g. in RDF and OWL things may have multiple IRI's in different namespaces; thus IRI's differ from UID's). Furthermore, the same name may be used for different things (homonyms), provided that the identical names are specified as having their base in different language communities (naming contexts).

This means that duplicate things are defined as things that are the same but that have different UID’s.

Such duplicates can thus be detected when different UID's are denoted by the same name in the same language community (although they may appear not to be real duplicates when the identical names are caused by a naming error). Although it is difficult for software to detect proven duplicates (e.g. in case of twin baby’s), software can conclude that it is likely that different UID’s represent the same thing in the following situations:

When different UID's are denoted by the same name in different language community contexts (homonyms), they are likely duplicates when they are individual things that are classified by the same kind (or by a near subtype of the other kind) or when they are kinds of things that are defined as being subtypes of (nearly) the same supertype. When they have the same name, but one is an individual thing and the other is a kind of thing then they are somewhat suspect. To reduce that chance it is recommended to denote individual things by names that start with a capital and to denote kinds of things by names that start with a lower case character.

When different UID’s are related to the same other role player, whereas playing roles of the same kind in relations of the same kind. This probability becomes stronger when cardinality constraints do not allow for that number of relations with the same role player. Note that the UID's are not duplicates when they play roles of different kinds in a relation between each other (especially when another relation explicitly states that they are distinct things; e.g. when A and B are a daughter of C, whereas a relation states that A is a first born twin of B); or when a time difference of occurrences in which they are involved excludes that they are the same thing.

Duplicate binary relations are relations that relate the same two things (UID’s) by a relation of the same kind (or by a subtype of the other kind of relation), even if the names of the related things are different (synonyms), whereas the validity periods have an overlap.        
For a duplicate higher order relation the same applies as for duplicate things. But in addition to that it holds that higher order relations are likely duplicates when each of them has involvement relations with the same involved objects, so that the same things are involved in two similar occurrences (activities) at the same time. However, this is not a proof, because occasionally the same object may be involved in two similar activities at the same time.

Note that in particular contexts databases might allow for the recording of different expressions of the same statements. For example, a database may record multiple identical opinions about the same topic expressed by different persons, possibly in the same language or in different languages. In those cases the things (UID’s) may still not be duplicated, because the expressions are about the same things. However, the expressions (utterances) should be distinguished from the statements (relations plus intention). In exceptional cases databases may allow for the recording that the same person expresses the same opinion several times. Then they are still not duplicate expressions, because they are expressed at different moments in time.

Consistent typology

Individual things can be explicitly be classified and kinds can have explicit specialization relations, whereas the fact that they are used in relations of particular kinds implies that they are (subtypes of) allowed kinds of role players for such relations. These various kinds should be consistent. This requirement results in the following rules:

Consistent typology (1): The language defining ontology defines kinds of relations by the kinds of roles that are played in such relations and by specifying which kinds of things are allowed to play such roles. This implies that it also specifies things of kinds that are not allowed to play such a role. Thus, individual things (which shall always be classified) may not play a role in a relation that conflicts with the kind that is required for its role player. Thus the kind of individual thing shall be identical or belong to the hierarchy of subtypes of the allowed kind of role player for the kind of relation.

Consistent typology (2): An individual thing may be classified multiple times by different kinds, and a kind may be a subtype of various supertype kinds. In both cases the kinds are subtypes of a nearest common ancestor kind. That nearest common ancestor kind shall be allowed as a role player in the relations of the kind in which that thing or kind is involved. The typology is inconsistent when one of the kinds is not allowed a relation of a kind in which the thing is involved.

Constraints and requirements verification

See also the related topic of modeling and verification of constraints and requirements.