KIMA02: Towards a Sustainable Gazetteer

Benjamin_of_Tudela
Benjamin of Tudela in the Sahara (Author : Dumouza, 19th-century engraving). Source: Wikimedia common

The Pelagios Resource Development Grant of the first round has enabled us to launch the project Kima, a Hebrew script, attestation-based historical gazetteer. The resulting resource was a promising database, which was, however, still unbalanced and required more work in order to make it usable as an encompassing, multipurpose gazetteer. We were thrilled, then, to hear that our application for the second round was successful.

The second RDG will enable us not only to consolidate the gazetteer with data entry through OCR and OCR correction of two large print gazetteers, and of the annotation, using Recogito, of the place names in two bilingual editions of medieval travel narratives. Furthermore, more than a rich resource in the Hebrew script, it will enable us to offer  a scalable contribution to any gazetteer and extention to Recogito by developing work flows for gazetteer building through Recogito.  We will expand here on three aspects of the work: populating the gazetteer, matching and geocoding, and finally, opening and sustaining the gazetteer.

Populating

The present Kima collection contains a substantial body of Ancient and late antiquity attestations with a ‘tail’ of Medieval and early Modern Attestations, as well as a very large Early-Modern and Modern, catalogue-based body of attestations. The Reason for the gap in medieval sources relates partly to historical reasons, and partly to the working procedure and policies of the Academy of the Hebrew Language, one of our main data providers. In the proposed phase of the project we would like to fill in the gaps in order to make Kima a more balanced, representative and multi-lingual corpus. Hence, the data entry stage of the work comprise of the digitization, OCR and OCR correction of two large print gazetteers, and of the annotation, using Recogito, of the place names in two bilingual editions of medieval travel narratives.

The bulk of medieval Hebrew script writing was done in non-Hebrew languages: first in Aramaic, the language of Rabbinic literature, and later in Judeo Arabic: A Arabic Language written in the Hebrew script. Both of these corpora are outside the purview of the academy of the Hebrew language. Fortunately, we have two expansive print gazetteers covering the relevant corpora that can thus help us complement our Corpus and achieve more historical, geographical and linguistic continuity with other Semitic language gazetteers such as Syriaca and al-Thurayya:

Screen Shot 2017-11-19 at 18.55.01Gottfried Reeg, Die Ortsnamen Israels nach der rabbinischen Literatur .
Ludwig Reichert Verlag, Wiesbaden 1983

Entries in the gazetteer include rich bibliographies, discussions on identification and suggestions for localization, as well as coordinates for the TAVO map. For reasons of copyright, only the names, variants and references to the primary sources where they appear will be entered into our database, and the index will be used to align the names with their Hebrew, Greek and transcribed forms.

Moshe Gil, Bemalchut Yishmael Betekufos Hageonim(Hebrew) Tel Aviv University and Bialik Publishing, Jerusalem 1997

The place index to the edition of Genizah manuscripts, in Volume 4, has normalized and vocalized forms leading to a document number. Both will be entered along with the original form as it appears in the (diplomatic) document page.

 

Another important addition to the corpus would be medieval itineraries; while the academy of the Hebrew Language has managed to include in their annotate corpus every Hebrew extant text until the end of the 11th century, the most fascinating geographical literature of the following centuries, including the famous travels narratives of Benjamin of Tudela and Petachia of Regensburg, is yet to be included. Translated editions of these texts are available in the public domain:

The Itinerary of Benjamin of Tudela; Critical Text, Translation and Commentary by Marcus Nathan Adler. London, Oxford University press 1907

Travels of Rabbi Petachia of Ratisbon; Translated from the Hebrew, and published, together with the Original on opposite pages by Dr. A. Benisch. London, Messrs. Trubner & co. 1856

Screen Shot 2017-11-19 at 18.30.08The work on the two edition would include scanning, OCR and OCR correction, encoding using Recogito and XML basic structural annotation. This would enable later alignments with Christian and Muslim travel and pilgrimage literature and the
development of more elaborate schema for travel literature, in a more advanced stage.

We also intend to use the encoding of these two texts as preparation for students assignments for the Introductory DH courses Sinai Rusinek wilI teach starting November 2017 at the Haifa University and Bar Ilan University

 Matching and Geocoding

Once the Aramaic and Judeo-Arabic additions to the KIMA ancient corpus will be made, we can proceed with the more challenging mission of matching the place names gathered in each collection with each other. Building on the lessons from the work on Kima Modern, parallel mappings of the collections to contemporaneous gazetteers will serve as quality assurance for the matching (see Kima final blog post ). This time the the matching will be done against historical gazetteers – Pleiades, Trismegistos, TIR and Syriaca. The collaboration with the groups working on and maintaining these gazetteers will include data exchange, which was partially discussed and agreed on already in previous Pelagios meetings in Madrid and Leipzig.

Opening and Sustaining

While the data at the basis of the gazetteer is already available for download and reuse, and the code is available on GITHUB, we aspire to advance the openness and sustainability of KIMA by making it available in database form which can be queried through an API, on a basic website. This will enable KIMA to become a useful reference tool and application and the Kima ID’s to become persistent URLs for open usage. In addition, the prototype Recogito plugin, which was developed by Dimid, in consultation with Rainer Simon, should be further developed into a robust, generic plugin which can support the work of other gazetteers in the future. We propose doing this by adding the following features:

A. Recognition of composite place names
B. Fuzzy search options
C. elaboration of the annotation and validation process to enable export of the results with statistics regarding to positive, false positives, and false negatives results.
D. Characterizing an additional component, which will enable update of the gazetteer with information conforming to our schema.

The two last features can be used for constant manual updating of the gazetteer with the texts contributed by our data providers, and strengthen the connections between Recogito and its hosted gazetteers, so that recogito is used also as feeding mechanism for the gazetteers or their underlying databases. The development will be done in concord with the annotation of the two travel narratives, which will be used as use cases for interactive annotation /gazetteer updating. Thus, a workflow will be modeled, in which a series of questions regarding each new tag will be asked: Was this a place name automatically identified? If not, was a similar form offered, of which it is a variant? If yes, is the date of the annotated text within the registered date span of the variant? if it was not suggested by the gazetteer, could it be (manually) identified as a place which appears in the gazetteer? These questions will structure a decision tree for additions of the annotated forms into the database as attestation of an existing variant, as new variant of an existing place, or as a new place in the gazetteer, which will then require additional verification and matching attempts.
The combination of features C and D will also support the NLP training of NER recognition, whether rules-based or through machine learning and thus the continuous improvement of the plugin.

Kima phase one hangover: Georeferencing, Evaluation, Validation and more refining

This post too was first posted on April 1st, 2017, on Pelagios Commons.

As mentioned in the previous post, Georeferencing our places, which we expected to be a simple script run and some tweaking of the data, turned out to be a more complex and challenging mission, but with a lesson and a bonus way of validating our work.

Approaching the task of georeferencing our places, we first had to admit that in our attestation based toponym corpus, we had two very different data sets: the part that was based on tagged texts, mostly ancient, and the part that was based on printed book catalogues (hence, 15c and later). Not only the sources and method are different: the two datasets represent very different phenomena: Toponyms in discourse, and the world of Hebrew print. They should therefore not be used as one database. To avoid anachronistic mis-identification, geocoding both datasets would also require a different type of work: ancient Toponyms would have to be matched against sources like Trismegistos, TIR and Pleiades; modern ones would be rather matched against Geonames. We are still in process of acquiring the ancient datasets for matching.

In the first attempt to match “Kima Modern” against Geonames we tried to match the English Primary forms of our places against the entire Geonames database. In a preliminary examination of a sample of the results we noticed the multiplicity of suggested matches for each form. Aiming for more precision, we limited the search to the Geoname feature class P, which contains cities, towns, and other types of settlements (as opposed to other types of geographic features such as regions, or rivers). We could do this because the place names in the Modern dataset were all places where books were printed.

We still remained with many non-matched place names, for which we attempted another method: matching all Hebrew variants against Geonames. In addition to over 900 places that were found with the first methods, 200 additional Hebrew variants were matched, still leaving us with quite a few non-matched place names, as well as places where there was a multiplicity of match. We were facing long hours of manual selection and retrieval of coordinates.

Recall of coordinates and multiple results were, however, not our only problem. In the course of perusing the results we realized that many of the automatic matches were wrong. Whether the cause of misidentification was in our catalogue, whether it was our failure in refining the data or whether it was due to a limitation of Geonames, we realized that we can no longer trust the data as we were hoping to.

The solution presented itself in retracing our steps: this time we matched all attestations against Geonames, while at the same time matching the primary English form, and included in the script a comparison between the extracted geonames. captureThis time, the statistics were better: though we still remain with 50% results that were only matched with one method, we managed to reduce the records that would have to be manually geocoded to 6% of the attestations (15,948 attestations, which correspond to over 1500 variants and just under 600 places). Moreover, we managed to have automatic validation for the identification of 44% of our attestations – over 100,000, corresponding to 500 places. The statistics here represents, in fact, the level of certainty in our identifications: the unknown, the certain, and the still to be validated.

 

And now what?

Our next step (hopefully this coming week) would be to try and match our data again, this time against Wikidata. We will use again the power of cross checking and parallel matching as validation – not only between Geonames and Wikidata, but as a way to validate as much of the data as possible.

We expect to have our data geocoded before the Passover holiday, and we will store them in this folder, where you can all view and download it. Until then, you can find here the yet-to-be-revalidated and yet-to-be-georeferenced data. To follow the code, look in Dimid’s Github here.

 

Aggregating Kima: an intermediate report

This post was first posted on November 29th, 2016, on Pelagios Commons.

With not so fashionable delay, I am reporting now on the second phase of our work. This phase started in October, when in the midst of the Jewish holiday deluge, our data whiz, Glauco Mantegari, came to Jerusalem to work with us on KIMA, the Hebrew Gazetteer.

PLACES

Glauco at Gehenna, overlooking Mt. Zion
Glauco at Gehenna, overlooking Mt. Zion

Having no center or institution to call our own, we were working in various chosen locations in Jerusalem and Tel Aviv, and the names of the places we frequented, with their stories, echoed our work. It is only appropriate, therefore, to start with the picture of Glauco Mantegari refining data on the slope of the notorious valley of the son of Hinnom, גיא בן הינום, which, with the ages, gave its name to the idea of hell and the purgatory: Gehenna. The evening was warm and pleasant, but on the screen, the purgatory of google refine was hard at work in an heroic attempt to bring order and meaning to a gigantic mess of historical data.

OUR SOURCES

The intensive week of work in Jerusalem started with meeting our data providers, in order to discuss with them their conventions of the data. Each provider had other ways of formulating temporal information, certainty, accuracy and historicity (e.g. expressions “Not before”, “around”, “probable” or “cataloger addition). academyFirst, was the Academy of the Hebrew Language. The Academy was established in 1953 with the mission “to direct the development of Hebrew in light of its nature”. One of its core projects is the preparation of a Hebrew Historical Dictionary. For this purpose, it created a database of texts of the different historical language strata, where each word in each historical text is analysed for its syntactic and semantic features. The corpus includes texts of all the extant Hebrew compositions from the time of the canonization of the Hebrew Bible until the end of the Geonic period, some Medieval Hebrew texts and large selections of Hebrew literature from the mid-18-th century until the founding of the State of Israel.

b6c933c2-55b0-47d3-9431-d6a2e284aebd
Figuring out the schema of the Academy of the Hebrew Language

From this cornucopia of language we received a total of 68108 textual attestations for 3,678 unique place names, in various forms and spellings. We have complemented these with 6, 351 attestations of place names from the Hebrew Bible, which were kindly extracted for us by Dirk Roorda and Martijn Naaijer from SHEBANQ. The biblical place names were collected from a digital version of the Biblia Hebraica Stuttgartensia (BHS) which was made available in the text database of the Hebrew Bible behind SHEBANQ, a system for the study of the Hebrew Bible.

 

OldcatalogueWhile the data we received from the Academy and from SHEBANQ was extracted from the body of texts, our second main data source is of different nature altogether: the library catalog. Our provider here is the National Library of Israel, which made available two catalogs and one large Thesaurus of authority files, titled “Agron”. The first catalog, the Bibliography of the Hebrew Book, is the fruit of many years of work on a project that documented over 100,000 records of known printed works of Jewish languages​​, found in collections in collections in Israel and abroad, and which were printed from the time of early press in the mid 15th century to 1960. The second is the catalog of the National Library itself, that amounts to 300,000 Hebrw records.

What makes these catalog records valuable for a Gazetteer is the librarians’ loyal practice of documenting, in a designated field (260a MARC record) the historical name of the place of publication as it was written at the time of printing, normally on the title page. Conveniently for us, in many cases a normalized form of the place name was entered by the cataloger in a separate field, and the most fortunate cases are when they were linked to a normalized, authority record place name, such as that kept in the library’s “Agron”.

ardon
Dimid Duchovny, Glauco Mantegari and Sinai Rusinek posing on the background of Isaiah’s Qumran scroll in Mordecai Ardon’s windows, at the National Library. http://web.nli.org.il/sites/NLI/English/library/aboutus/past/buildings/Pages/ardon.aspx

THE SCHEMA

Having learnt from our data providers their various conventions, the next step was conjuring a Geo-Json Schema that would enable to translate the data from the various sources and aggregate it in one structure. The weeks that followed were dedicated to translating the data from the various sources to the schema, adjusting it when needed, and finally, matching and joining them together. This is where challenges of messy and big data surface, from hidden encoding variations (even within one encoding system!) to our computers protesting and fainting from the hard computation labor. But we will prevail! the first version of the core of our Gazeteer will be at Pelagios soon.

CONNECTING TO RECOGITO

While assisting the Gazeteer making processes by scripting and parsing, Our team’s developer Dimid Duchovny was also working with Pelagios’ Rainer Simon on adapting a plugin for Recogito to enable automatic, as well as manual mark up of Hebrew place names.

To test the plugin, I evaluated a preliminary automatic mark up of a medieval text that I manually marked in advance: the Journeys of Rabbi Petachia of Regensburg. This revealed the predicaments of Hebrew NLP: first and foremost, the lack of vowel letters creates multiple ambiguities: the word אולם, for example, may be read as the name of the German city of Ulm, but also as “Ulam” the Hebrew word for “but”. This is a problem that could only be reduced by morphological analysis of each text, or by automatic addition of vowel diacritics.

A second interesting predicament, caused by Jewish and Israeli geographical history, is apparent in the mark up: several person names mentioned in the Medieval text, such as Amazia, Tuval and Rabbi Petachia’s own name – all traditional, biblical person names – are detected and marked as place names in the text. Indeed, in the 20th century, newly established Kibbutzim and Moshavim were named after ancient kings and heroes, thus creating a challenge for linguistic disambiguation. And vice versa: many modern Hebrew names are given after place names, whether biblical or not. This is a problem that could be solved in future versions of Recogito, if a choice will be given to select temporal subsets of the gazetteer.

For both these problems, at this point, we have to rely on manual correction through Recogito’s validation function. Having the data openly available, however, we hope it will attract NLP scholars who would take the challenge of training their named entity recognition software and applying it to KIMA.

Screen Shot 2016-11-22 at 22.17.05
Detail from the Agron windows depicting the verse “Come let us go up to the Mountain of the Lord” (Isaiah II:2-4) in several languages and alphabets. http://web.nli.org.il/sites/NLI/English/library/aboutus/past/buildings/Pages/ardon.aspx

Introducing Kima

This was first posted on the Pelagios Commons blog on July 30, 2016

amstelodam
from the place of publication column, Bibliography of the Hebrew Books database

I received the wonderful news about the Pelagios Commons Resource Development grant just before leaving my home in the city of 70 (Hebrew) names, to go to the wondrous DH2016 conference in beautiful Krakow. Krakow may only have one Hebrew name, written קראקוב, but, if you would check the imprints of books printed there in the Hebrew script for the last four centuries, you would find that it was also spelled קראקו, קראקוי, קראקוי, קראקע, קרקא, קרקו , קרקוי and קרקוב. And this is a relatively simple case: look at what happens to the Hebrew Amsterdam!

Spreadsheet functions like filter and sort may help in looking for variations of place names, and the wonderful Open Refine can do wonders with cleaning large collections, and yet, we need to know more in order to also connect, for example, קושטא (Kosta), קונשטנדינה (Konstandina), סטמבולי(Stamboli), קונסטנטינופול (Constantinopole) and איסטנבול (Istanbul) – some of its more common historical names in Hebrew and in other Jewish languages, each occurs in various phonetic transmutations and spellings. This knowledge is out there – or perhaps it is more appropriate to say: IN there, in printed books, paper maps, or silo databases. Our goal is to assemble it, link it and open it to human and machine readers alike.

Why Hebrew?

Screen Shot 2016-07-30 at 13.35.14
The 41 stations of the Israelites in the desert. Detail from a map by Avraham Ben Ya’akov Hager, Amsterdam 1712. Made available online by the National Library of Israel.

Hebrew place names are a complex, problematic and fascinating phenomenon which stretches far beyond the spaces in which Hebrew speakers dwell. Their history spans over three Millennia, starting with the biblical place names that constitute a common world heritage. Mt. Sinai and Mt. Zion not only exist in near eastern topography, but also in the imagination of anyone who has ever read the psalms; The Golgotha in Jerusalem and the Sea of Galilee (A.K.A the Kinneret or Lake Genesaret) are depicted on the walls of thousands of Churches throughout the world. The ancient Hebrew memory of space spreads further – from the origin lands of Mesopotamia to the Egyptian and Babylonian captivities, which gave these two areas parallel existence as places as well as global metaphors for striving for freedom and for yearning to come home.

Since late antiquity and medieval times, centuries of Jewish diasporic existence projected its own coordinate system on the map of the world. Though muted as spoken language, Hebrew was very much alive as a poetic and especially as Halachic (Religious-legal) language, which interacted with local contemporaneous place names whenever and wherever Jewish people dwelled: the local rabbinate authoritative spaces, the network of responsa correspondence, diverging cultural, Halachic and liturgical traditions created the cultural continents of Sepharad and Ashkenaz, categories at once anchored in geography and mutable by temporal and cultural dimensions. The importance of space in the Jewish religious practice and meticulous attention to language engendered rich Toponym literature such as the “Shemot Gitin” lists of place names, required by Jewish divorce law. In Modern History, the pale of settlement, with its unique Jewish life of the Stetl (small provincial towns), its social welfare systems, Hasidic Rabbinic courts and the Yeshiva education centers, were parts of a European geography now extinct. To the south and east spread the history of Jewish life in the Arab world, which was also obliterated in the 20th century. Efforts to record, map and virtually reconstruct these spaces are abound, and yet, the resources collected are not yet available as open and linked Geographical data, to which the surviving cornucopia of historical Hebrew script texts, periodica and literature, could be connected.

A note on the languages:

The Hebrew Script was used not only for pre-modern and modern Hebrew, but also Aramaic and Yiddish, as well as several families of Jewish Languages, such as the Judeo-Arabic, Judeo-Spanish (Ladino) and others – a variety engendered by diglossia and multiple contacts with local population. Thus the Hebrew script names (which will also be made available transcribed) may contain historical evidence pertinent to toponymies of other languages as well.

We elaborate on our preliminary work plan for the Hebrew Historical Gazetteer in the proposal, which you may find here. At present, we are engaged in surveying print and digital sources and resources, figuring out which and how to digitize (without impinging on copyrights) and building a data model.

We will be grateful for any comments and suggestions!

Screen Shot 2016-07-30 at 13.28.37