Wednesday, March 29, 2006

Multilanguage folksonomy

Phew, multilanguage tags are really a challenge. To design a reasonably efficient tagging system is already challenging enough, especially when it comes to generating A. tag clouds and B. related tags. We have optimized point B by keeping a table of cotags along with their valence, i.e., given a tag X and a tag Y we store how many objects are tagged with both X and Y. This only helps to find related tags for one specific tag. Note that this list of cotags is symmetric, so we can save half of the storage. For a tag combos one would need an analoguous table with n-tuples, which would get REALLY huge even though this n-tensor structure is only sparsely filled.

And yet I'm ignoring tag clustering which has to be computed offline due to it's high complexity. But apart from the computational effort there are some conceptual issues to be taken account for sites offered in several locales. So far we have 4 languages: english, spanish, german and catalan. Users from different countries will most likely tag the objects (in our case groups and venues) in the language that the website is shown to them. So the tag cloud would probably simultaneously contain tags such as "arts", "arte" and "kunst", which unnecessarily blows up the tag cloud.
Also an english-speaking user looking for "arts" would like to find venues which have been tagged
with "kunst" by some german user.

See some ML tags in action here (not that the language you specify in the browser will be detected):

A straightforward "solution" would be to keep 4 different sets of tags, one for each language. Then the tag cloud would not be contaminated by foreign tags, but the problem mentioned above would not be solved. We have decided to store for each tag the name in each of the 4 languages. If a user tags a venue with "arte", it will not be tagged in a multiple way if it has been tagged with "arts" before since it refers to the same tag. Now comes the difficult part. This relies on the tags to be already in translated form in the database which is a somewhat unrealistic assumption if there are no complete dictionaries for translation between all languages available. If a new tag appears it will be inserted in its raw form, i.e., in only one language and has to be translated by hand. One might try to let users participate in this process and let them translate certain tags into their own language. This would even result in a collaborative generation of a tag dictionary and would be a good candidate for a open public project.

Another thing to mention is that the indices of the DB grow considerably, since for SQL statements as

SELECT id FROM tags WHERE en='arts' OR de='arts' OR es='arts' OR ct='arts';

a multi-column index is required. Tags may be extremely convenient and intuitive for the user, but I'm really surprised how much care is needed to implement them in a most efficient way.

Technorati: , , , ,


Post a Comment

<< Home