User Tools

Site Tools


okkam:l3s_entity_lifecycle_management_approach

L3S Entity Lifecycle Management Approach

What is an Entity?

We will definitely need a formal definition of an entity. For now, we only state that we probably do not want to consider entity descriptions as entities themselves (cf. paragraph “Distiguishing Data Model and Entity Representations”). This prevents loops.

Granularity should be discussed, too. E.g., is the arm of a person an entity, is a person's finger an entity, and so on.

A definition attempt: “Everything which someone requests an Okkam ID for is an entity.”

An Entity and its Representations

We make the distinction between an entity and its representations. An entity is the real-world object, which cannot be addressed (referred to) directly. Instead Humans use natual language representations to describe real-world entities. Similarly, in the semantic web, we use computer-readable representations (e.g., RDF statements).

Distinguishing Data Model and Entity Representations

Assuming using rdf to represent entities we have to take care that not each rdf-resource is representing an entity . That means, keep a clear distinction between real entity representations and rdf resources to represent entities and their properties.“ and remove this section completely. Let us take the description of a person: Julien Gaugaz. In RDF this could be done by:

julienURI hasName julienNameURI.
julienNameURI first "Julien".
julienNameURI last "Gaugaz".

In the above statements we find two RDF Resources (julienURI and julienNameURI). Whereas julienURI indiscutably represents an Entity, julienNameURI should probably not be considered as an Entity Representation. This is, however, still to be precised in the “What is an Entity” section.

Evolution

The side-effect of time passing by is that things change. Since this is true for real-world entities, we should investigate how their representations may be affected.

Entity Evolution

A real-world entity like a person changes over time: she changes job, moves to a new city, gets married, divorces, etc… One of the central questions which have to be addressed in entity lifecycle management is: ”Is an entity which changed still the same entity as before?“.

This poses several questions:

  1. How to represent the different “versions” of an entity?
  2. Do we need to identify (i.e., to attribute an ID to) the different versions of an evolving entity? And if this is the case, how?
  3. Do we delete entities or their attributes at a certain point or do we keep all forever?

Entity Representation Evolution (i.e., Revision of decisions)

Even in a fixed (non-changing) real-world, entity's descriptions are not available at once. The information about an entity can come from different sources at different times.

As an example, let us consider two person entities having the same name: Peter Fankhauser. The web site of each of their employers (L3S, and an advertising company in Switzerland, respectively) mention that “Peter Fankhauser” is working in the company. Now Okkam is asked to attribute an ID to “Peter Fankhauser” (from the L3S web site) and to “Peter Fankhauser” (from the Swiss web site), without further information. In this case let imagine that Okkam (wrongly) identifies the persons as one and same one, it will therefore attribute the same ID to both real-world entities. Now, if later on, Okkam gets their birth dates, which are different, it will be necessary to perform a revision, i.e. to split the entity representation in 2 entity representations having different OkkamIDs.

Similarly, one real-world entity could firstly be detected by Okkam as two different ones and therefore be attributed two different OkkamIDs. Later on Okkam gets to know that it has to do with one and the same entity, and has to merge the two representations. This could be the case when information about the same entity comes from two different web sites, like Fraunhofer ISI (last employer of Peter), and L3S (new employer of Peter). Since they describe a person with different employers Okkam creates two different IDs. Later on, the information that Peter worked at Fraunhofer ISI before changing job to L3S would let Okkam revise its previous decision and merge the two representations.

Propagating Evolution

Evolution of knowledge in the Okkam repository may require a certain kind of propagation. For now, we envision two different ways of propagation: Notifications and Implicit evolution.

Notifications. As an example, it may be required that a former requester of Peter's Okkam ID needs to be informed about the fact the there are now two Peters and not one. (In a large-scale Okkam scenario this may not apply but in business applications these kind of notifications are needed.) These kind of notifications may be exploited for a local purging at a client's side: updates of Okkam-IDs should be propagated to such processes. E.g., if I used a (now wrong) Okkam-ID for Peter in my local knowledge base, a possible local purging process may be fed with the fact that this ID is outdated and that the actual ID should be used instead.

Implicit Evolution. Evolution of explicit knowledge (e.g., that two persons are actually one or that the affiliation of a person has changed) may require the evolution of related knowledge. An example applies to the merging of entity representations. Let us assume that there are two Peters. Each of them is married to a different entity (represented as a link to another person entity). If I merge these two Peters into a single Peter, should'nt I also merge both wives?

Representational Variations (i.e., representing conflicting values)

Another problem when we have different information sources about one same real-world entity is that they might provide different incompatible informations. Let's take as an example a person: Julien. On one web page Julien appears to be born in 1979, and on another one it says 1997. Assuming that Okkam has sufficient evidences to conclude that it is question of the same entity, it has to handle in some way the incompatible information that he is born in two different years: 1979 and 1997.

To take into account only the real value and discarding the other (assuming that this can be done somehow) is not desirable. This is because the error which led to the wrong value (in this example a typo where 9 and 7 have been swapped) might occur again in other information sources, and Okkam will have to recognise this and provide the correct OkkamID, even though the entity description provided as a query is wrong, i.e., mentions a wrong value for the year of birth.

Distinguishing representational variations from collections

Considering the above, it is desirable that Okkam can deal with alternative attribute values. But the semantics of variations is not the same as the one for collections of values. The need for collections of values arises as soon as entities descriptions contain properties whose relationship with the entitiy is of kind “1-n” (as it happens with property “author” for a paper, since a paper can be written by different authors). It must be clear that different values belonging to a collection are not alternative. We have to ensure that the values of an attribute in the representational model can be a lists.

Uncertainty in Entity Representations

We distinguish several kind of uncertainty related to entity representations. More investigation is necessary to find out which are useful, how to represent them, how to use them and when.

  • Confidence in the information source. Some information sources are more reliable to provide data than others. This might depend on entities ownership, but also on the user: I trust site A, but not site B, and my colleague trusts more site B than A.
  • Extraction confidence. The extraction process is uncertain in essence. The different uncertainties related to information extraction will have to be identified, and a way to represent, estimate and use them will have to be considered.
  • Popularity. It is not clear yet what popularity we want to consider. We can distiguish the popularity of an entity, of one of its representation alternatives (see previous section) or popularity of information sources. (Gianluca is interested in entity popularity. Define ways of measuring entity popularity, e.g., with click logs, link structure, … and use popularity score in entity ranking.)

Information Sources

Is is yet unclear where the information present in Okkam repositories comes from, how and when. It is useful to list the information sources and to characterise them. Decisions related to lifecycle management might depend on the information source, but also on the kind of information source. Up to now we identified the following types of information sources:

  • Queries. Consider a person entity, Julien Gaugaz, born in 1979. Now Okkam receives a request for ID regarding an entity, Julien Gaugaz, born in 1979 and working at L3S. The question of interest here is whether or not to add the information “works at L3S” to the Okkam repository.
  • OKKAM-enabled tools. For example, when editing a document entities could be recognized and assigned an ID.
  • OKKAMizers. We can imagine the Okkam system crawling the web, or selected resources, to find new information about existing entities or new entities.
  • ECSSE. The entity-centric semantic search engine could also be a precious source to find new information about a given entity.

Careful considerations about Spamming are required when talking about queries, Web-Sites, etc., as information sources

Online vs. Offline Entity Lifecycle Management

Regarding all the tasks of entity lifecycle management, we have the traditional possibilities to execute the tasks online, i.e. immediately before responding to any other matching request, or offline, i.e., later when computing resources allows. Tasks include the revision of entity identity decision, the updating of attribute values, the adding or removing of attributes, etc.

As another solution (and less radically) we can imagine intermediate solutions, like marking the object of the task (e.g., merging two entity descriptions) as pending, and do it later on. This pending mark can be used in different manners. For example the task could be executed when a request involving the object arrives, or later on offline if no request involving the object arrives.

We sum up four (possible and not exhaustive) approaches when to apply the evolution changes:

  1. before the system responds to a new request, it performs the entity lifecycle management (in the following “ELM”) task (online)
  2. tagging the changed resource and as soon as the resource is considered for a new request, the system performs the ELM task (tag + online)
  3. tagging the changed resource and perform the ELM task as soon as the system is not busy (tag + offline)
  4. perform the ELM task on the entire collection as soon as the system is not busy (offline)

Non-Entities

Assuming that we have a definition of what an entity is, a strategy will have to be elaborated concerning what to do with requests regarding representations of things which are NOT entities.

Alternative IDs

Other Okkam-like system

okkam/l3s_entity_lifecycle_management_approach.txt · Last modified: 2008/06/25 13:02 (external edit)