Automating translation of strings in OSM

I’ve been thinking a little bit about automating the translation of maps into multiple Indic languages ever since I saw the Kannada map at geoBLR in March.

I started some work on it today, and I have lots of interesting things to report. Right now I am mostly transliterating as opposed to translating but if a dictionary of common words/tags can be compiled, upgrading the script to translate instead of transliterating should be doable.

Here’s the algorithm I followed:
  1. Get the nodes within a bounding box from OSM using the python wrapper for Overpass – overpy – This returns a collection of nodes and associated ID, tags, lat, lon and other attributes. This can also be repeated for ways by using the corresponding overpy query.
  2. Filter nodes that have tags
  3. From the result of the filter, identify nodes with Indic language tags – eg:[“name:kn”]
  4. Transliterate the string value for tag[“name:kn”] to another language – I used Tamil – and store it within tag[“name:ta”] – I used the Indic transliterator APIs from SILPA for this
  5. Create a new changeset and upload the result(node with tag[“name:ta”]) to OSM using osmapi

I did it only for one node: https://www.openstreetmap.org/edit?node=1118255762#map=19/12.99451/77.55430

Advantages
  • Indic to Indic transliterations – ✓The Indic transliterator APIs seem to convert quite effortlessly from one Indic language to another. Right now, support is available for Hindi, Tamil, Punjabi, Gujarati, Malayalam, Oriya, Bengaliand Kannada. So, if a Kannada tag exists in OSM, the same text can be transliterated into multiple Indic languages using the naive algorithm I described above.

Limitations

  • English to Indic transliterations – X: Though the Indic Transliterator works for English To Indic transliterations as well, it is not very useful. This is because only English words that are in the CMU dictionary are capable of being transliterated – which means that we can’t transliterate “Raajaajeenagar”, even if we had a custom tag for transliteration on OSM. On emailing the developer of the transliterator about extending the capabilities of English transliteration, I was told that extending the dictionary by adding additional words is one option. I am not sure of how feasible this is, or how much more optimal it is as compared to translating to one Indic language and transliterating+translating to the rest.
  • Translations of English Words – X – Right now, I am only able to transliterate words, but if a list of common words(I am guessing all the OSM tags, and other common words) could be compiled, and translated into all the Indic languages, the translation process can be automated quite easily. This would require the algorithm to have 2 additional steps
    1. From an Indic tag(i.e., an already translated tag, we would have to identify portions that are in the translations list, and leave them out of the transliteration process.
    2. For the word(s) identified in step 1, we must find a translation in the translations list for the language we are translating into. This must then be suffixed or prefixed with the transliterated portion. I am guessing suffix will be the norm, while prefixes might occasionally be necessary.
  • Tracking node version numbers – X – Right now, I am unable to track the version attribute of a node tag using the overpy API. I entered the version number manually. Not sure if I am missing something. This is just a “need-to-figure-out” issue more than anything. This is very important for automatically updating a node to the server because if there’s a mismatch between the version number being passed to the API and the version number on the server, the API won’t work.
  • Which Indic Language to begin transliterating in – Issues might arise if a language like Tamil – where the letter for ka, kha, ga, gha etc is the same – is say used to transliterate to Hindi. But, if we use a language like Kannada or Hindi for the first time, this issue can probably be resolved easily.

The script is on Github. Feel free to fork it, use it, work on it, edit it and suggest changes, different language, other possibilities, alternatives etc. Pull Requests very welcome. :)

This is my first time writing code in Python, so advice on improving code would be very welcome. Also, let me know if I’m missing something else, obvious or subtle.

Follow

Get every new post delivered to your Inbox.