Unique identifiers, synonyms and homonyms

It is widely recognized that there is a difference between things (or concepts) and the various names by which those things can be denoted. Nevertheless, in databases and date exchange messages, things are usually denoted only by names, which results in interpretation problems and difficulties for interoperability of systems and for data integration. This can be solved by using a formalized language within which unique identifiers represent the things (including also concepts, aspect, relations, etc.), whereas multiple different names are allowed for usage by different parties. This article discusses the advantage of this solution over other approaches.

Things (objects as well as aspects and relations) are usually identified by names. However names are only unique identifiers within a very limited context. This is caused by the fact that people often denote the same thing by different terms (synonyms), and also that one term may denote different things (homonyms), dependent on the context in which the terms are used. Thus in a normal business context for systems and databases, names are not unambiguous denotations for things. This is one of the problems for wider contexts, such as data integration and interoperability of systems. Therefore, several solutions are proposed for the creation of unique identifiers (UID’s).

Forced unique names

To address this issue, many systems do not allow for the use of homonyms and prescribe that different (artificial) terms should be used to distinguish the various meanings. For example, instead of having the term ‘building’ to denote two different concepts they may prescribe the use of ‘building (activity)’ distinct from ‘building (object)’. Often systems also do not allow for synonyms, so that only one particular term may be used to denote a concept. Both solutions address the issue within a small closed community only, because in a wider context different people will make different choices for the allowed terminology, which hampers interoperability in a wider context.

Random unique numbers

If random numbers are generated from a very large domain, then the probability that identical numbers are generated is very small. This is usually a sufficient reason to assume that the generated numbers are universally unique. This enables to use UID’s without a central authority to manage the uniqueness of the identifiers. This is the basis on which random generated (probably) unique identifiers are used to uniquely denote things. Examples of such systems are various versions of the Universally Unique Identifier (UUID) as standardized in ISO/IEC 11578:1996 (http://en.wikipedia.org/wiki/Universally_unique_identifier) and the Globally Unique Identifier (GUI) (http://en.wikipedia.org/wiki/Globally_unique_identifier). This solves the issue of homonyms in a general context, but it does not solve the synonyms issue, because a random number generating system allows that multiple parties create different UUID’s for the same thing. Thus there are separate statements required that specify that such UUID’s are synonymous. As synonymity is one of the issues that that should be solved, UUID’s are only suitable for situations where synonymity is not required, such as for unique product coding.

Namespaces

An approach for a wider community is to specify a unique identifier that consists of a combination of a ‘namespace’ and a name, whereas a name shall always be accompanied by a namespace. A constraint for a namespace is that all names within the namespace are unique (http://en.wikipedia.org/wiki/Namespace). Only then a combination of namespace and name uniquely denotes something. For example namespace ‘architecture’ may contain the term ‘building’ and namespace ‘activity’ may also contain the term ‘building’. However, those namespaces do not prevent that different namespaces use different names for the same thing, without specifying that those denotations are synonymous. Thus it is uncertain whether ‘architecture building’ is the same thing as ‘activity building’ or not, although in this example the namespace names may suggest that they are different. To address synonymity it is required that explicit relations specify that such different ‘unique identifiers’ denote the same thing.

Gellish UID’s

In the approaches with random unique numbers as well as namespaces the things themselves are not directly represented. Thus in those approaches (‘languages’) only terminology appears, but the things which they denote do not appear. Formal English is a formalized language that explicitly distinguishes between the represented things and the terms to denote the things. Each represented thing is uniquely represented in the (whole) language by its own Gellish UID (an arbitrary natural number), whereas the various terms are only meant for human users. The terms enable them to find the intended UID’s. The fact that Formal English is a formalized language implies that its terminology (dictionary) and its UID’s are managed. The ranges for UID's include a range for unknowns (in queries), a reserved general range, delegated managed ranges as well as a free range. This means that UID’s can be allocated to represent things and that multiple names (synonyms, including also translations, abbreviations, codes, etc.) can be specified to denote the represented things via their UID’s.
Then relations are defined as relations between UID’s and not as relations between the denoting terms!
This solves the synonyms issue (for computers), because all synonym terms denote the same UID. It also solved the homonyms issue, because all homonyms have different UID’s.
Note that external usage of Gellish UID’s (e.g. in RDF expressions) is possible by using ‘Gellish’ as a namespace. For example, Gellish:40018 would refer to the concept ‘building’ in the namespace ‘building technology’. A direct URI reference, for example to concept 730000 and to the term 'anything' would be:
http://www.formalenglish.net/dictionary#730000 or http://www.formalenglish.net/dictionary#anything, whereas only the latter could find homonyms, if available.

The question remains how users can find the right UID’s, because they always denote things by names. This is solved by two mechanisms: 1. For each term a user can see the ‘language communities’ (namespaces) within which the terms uniquely denote something. Those language communities define the ‘home base’ for the terms, which means that they are the preferred terms to denote the things (UID’s) in that language community. 2. In Formal English, each concept (kind) is defined by a relation to its supertype concept(s) and each individual thing shall have one or more classification relations with classifying kinds. Those supertypes and classifiers are normally different for homonyms.

The Gellish UID’s are language independent, which means that expressions (relations) are natural language independent. This enables presentations of expressions in various languages (or automated translations between languages) as soon as dictionaries in the languages are available.
In a multi-lingual environment the used language acts as a second kind of namespace for the terms, because for each term it is specified to which language it belongs.