latest news

01.09.2015

An interactive demo of the Concordia system was released. More details: link.

28.04.2015

The first version of the Concordia project is out!

24.04.2015

English version added.

21.04.2015

Changed main picture.

04.10.2013

Site update for winter semester 2013/2014.

29.03.2011

Site update.

02.10.2009

Site update for winter semester 2010/2011.

25.09.2009

Site created.

additional links

EduWiki system

contact

dr Rafał Jaworski,
rjawor at amu.edu.pl

Concordia

I am the developer of the Concordia project - a full text search library best suited for Computer-Aided Translation. Be sure to visit its home page at: http://tmconcordia.sourceforge.net/

publications

Be sure to check my profile at Google Scholar. It provides, among others, the below data in BibTeX format. You can also visit my profile at ResearchGate.

2021

  • [31] Jaworski R., Dunđer I., Seljan S. (2021) Usability Analysis of the Concordia Tool Applying Novel Concordance Searching. In: Rocha Á., Ferrás C., López-López P.C., Guarda T. (eds) Information Technology and Systems. ICITS 2021. Advances in Intelligent Systems and Computing, vol 1330. Springer, Cham. https://doi.org/10.1007/978-3-030-68285-9_14

2020

  • [30] K. Stroński, R. Jaworski: "Purānī pahāṛii bhāṣāoṁ ke ḍijiṭal saṁsādhan." in B. K. Joshi, B. K. Joshi (Red.), Language Endangerment and Language Revitalization in Himalaya: Proceedings of the International Seminar on Endangered Languages of Himalaya, Almora 2018 (ss. 57–67). Bishen Singh Mahendra Pal Singh., 2020
  • [29] K. Stroński, J. Tokaj, R. Jaworski: "Diachrony and typology of non-finites in Indo-Aryan", http://doi.org/10.14746/9788395414442, 2020

2019

  • [28] K. Kułak, R. Jaworski: "The effect of individual epistemological factors on attitudes to nonstandard language use in native speakers of Polish", Poznan Studies in Contemporary Linguistics, 55(1), pp. 135-156, 2019

2018

  • [27] O. Witczak, R. Jaworski: "CAT Tools Usability Test with Eye-Tracking and Key-Logging: Where Translation Studies Meets Natural Language Processing", Między Oryginałem a Przekładem, 24(41), pp. 49-74, 2018 [pdf]
  • [26] R. Jaworski, K. Jassem, K. Stroński: "Binary Classification Algorithms for the Detection of Sparse Word Forms in New Indo-Aryan Languages", Human Language Technology. Challenges for Computer Science and Linguistics, Lecture Notes in Artificial Intelligence, vol. 10930, ISBN: 978-3-319-93781-6, pp. 123-136, Springer 2018

2017

  • [25] R. Jaworski, K. Stroński: "Automatic Converb Detection in Early Braj", Human Language Technologies as a Challenge for Computer Science and Linguistics - Proceedings of the 8th Language & Technology Conference, November 17-19, Poznań, Poland, pp. 342-346 [pdf]
  • [24] R. Jaworski, S. Seljan, I. Dunđer: "Towards educating and motivating the crowd – a crowdsourcing platform for harvesting the fruits of NLP students' labour", Human Language Technologies as a Challenge for Computer Science and Linguistics - Proceedings of the 8th Language & Technology Conference, November 17-19, Poznań, Poland, pp. 332-336 [pdf]
  • [23] F. Graliński, R. Jaworski, P. Wierzchoń: "Towards Automatic Detection of Correct Domain Words in OCR Texts from Polish Digital Libraries", Human Language Technologies as a Challenge for Computer Science and Linguistics - Proceedings of the 8th Language & Technology Conference, November 17-19, Poznań, Poland, pp. 274-278 [pdf]
  • [22] R. Jaworski, M. Ogrodniczuk: "Expanding the functionalities of the Language Resources Switchboard by integrating a set of tools for the processing of Polish language", Proceedings of the CLARIN Annual Conference 2017 in Budapest, Hungary [pdf]
  • [21] F. Graliński, R. Jaworski, Ł. Borchmann, P. Wierzchoń: "The RetroC challenge: how to guess the publication year of a text?", Proceedings of the Digital Access to Textual Cultural Heritage conference (DATeCH2017), Göttingen, Germany, 2017 [pdf]
  • [20] R. Jaworski, K. Stroński: "Recognition and multi-layered analysis of converbs in early NIA", Proceedings of the 33rd South Asian Languages Analysis Round Table SALA-33, Poznań, Poland, pp. 55-56, 2017 [Full conference proceedings]
  • [19] O. Witczak, R. Jaworski: "CAT tools usability test with eye-tracking and key-logging: where Translation Studies meets Natural Language Processing", Points of View in Translation and Interpreting, Kraków, Poland, 2017 [pdf]

2016

  • [18] F. Graliński, R. Jaworski, Ł. Borchmann, P. Wierzchoń: "Vive la Petite Différence! Exploiting Small Differences for Gender Attribution of Short Texts", in: A. Horak, K. Pala, P. Rychly, A. Rambousek (Eds.) Proceedings of CBBLR 2016 Community-based Building of Language Resources, Brno, Czech Republic, pp. 9-15, 2016 [pdf]
  • [17] F. Graliński, R. Jaworski, Ł. Borchmann, P. Wierzchoń: "Vive la Petite Différence! Exploiting Small Differences for Gender Attribution of Short Texts", in: Petr Sojka, Ales Horak, Ivan Kopecek, Karel Pala (Eds.) Text, Speech and Dialogue - Proceedings of 19th International Conference TSD 2016, Lecture Notes in Artificial Intelligence vol. 9924, pp. 54-61, 2016 [pdf]
  • [16] A. Jaworska, R. Jaworski, D. Dzienisiewicz: "Zastosowanie archiwaliów i nowoczesnych technologii w służbie badania języka", XVIII OZSA: Przeszłość dla przyszłości, Poznań, 2016 [pdf]
  • [15] F. Graliński, R. Jaworski, Ł. Borchmann, P. Wierzchoń: "Gonito.net - Open Platform for Research Competition, Cooperation and Reproducibility", in: Branco, António, Nicoletta Calzolari and Khalid Choukri (eds.), Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language, pp. 13-20, 2016 [pdf]
  • [14] R. Jaworski, K. Stroński: "New perspectives in annotating early New Indo-Aryan texts", Proceedings of the 32nd South Asian Languages Analysis Roundtable (SALA-32), pp. 66-68, 2016 [Full conference proceedings]

2015

  • [13] Ł. Borchmann, F. Graliński, R. Jaworski, P. Wierzchoń: "A semi-automatic method for thematic classification of documents in a large text corpus", Proceedings of the Workshop on Corpus-Based Research in the Humanities (CRH) pp.13-21, 2015 [Full conference proceedings]
  • [12] R. Jaworski, K. Jassem, K. Stroński: "Manual and Automatic Tagging of Indo-Aryan Languages", Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 550-554, 2015 [pdf]
  • [11] K. Jassem, F. Graliński, M. Junczys-Dowmunt, P. Skórzewski, R. Grundkiewicz, M. Walas, R. Jaworski, T. Dwojak: "PSI-Toolkit - an Extensible and Tightly Integrated Set of NLP Tools", Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 280-282, 2015 [pdf]
  • [10] R. Jaworski: "A novel method for finding and scoring valuable translation memory repetitions", Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 155-159, 2015 [pdf]
  • [9] K. Anderson, L. Duranti, R. Jaworski, H. Stancić, S. Seljan, V. Mateljan (eds.): The Future of Information Sciences: e-Institutions, Openness, Accessibility and Preservation, Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of Zagreb, 2015.
  • [8] R. Jaworski: "Approximate sentence matching and its applications in corpus-based research", The Future of Information Sciences: e-Institutions, Openness, Accessibility and Preservation, pp. 21-30 (keynote paper), 2015 [docx]
  • [7] K. Jassem, R. Jaworski, K. Stroński: "IATagger – a Tool for Tagging Indo-Aryan Texts", Proceedings of the Poznań Linguistic Metting Conference, 2015 [abstract]

2014

  • [6] R. Jaworski, R. Ziemlinska: "The translaide.pl system: an effective real world installation of translation memory searching and EBMT", Proceedings of the 17th Annual Conference of the European Association for Machine Translation EAMT2014 p.53, 2014 [Full conference proceedings]

2013

  • [5] Rozprawa doktorska: "Algorytmy przeszukiwania i przetwarzania pamięci tłumaczeń", 2013 [pdf]
  • [4] R. Jaworski: "Anubis - speeding up Computer-Aided Translation ", Computational Linguistics – Applications, Studies in Computational Intelligence vol. 458, pp. 263-280, Springer-Verlag, 2013 [pdf]

2011

  • [3] R. Jaworski: "A sentence Clustering Algorithm for Specialized Translation Memories", Speech and Language Technology (SLT) vol. 12/13, pp. 97-103, 2011 [doc]

2010

  • [2] R. Jaworski, K. Jassem: "Building high quality translation memories acquired from monolingual corpora", Proceedings of the IIS 2010 Conference, pp. 157-168, 2010 [pdf]
  • [1] R. Jaworski: "Computing transfer score in Example-Based Machine Translation", Lecture Notes in Computer Science (LNCS), Springer-Verlag, pp. 406-416, 2010 [pdf]

scientific work

In my scientific work I conduct studies in computational linguistics, searching algorithms and some machine translation techniques. I would be glad to start scientific projects or help anyone interested in the following subjects:

Approximate searching - my main field of interest. Its task is to find in the text not necessarily exact matches of input pattern. The technique is applied whenever the searched resource might contain errors (e.g. OCR corpora).

Example Based Machine Translation - my former main field of interest, subject of my Master's Thesis. EBMT is translation based on analogy principle. Its idea can be explained as follows:
Let the language from An example is a pair of sentences (in source and target language respectively) where the target sentence is a good translation of the source sentence. A large set of examples is called a translation memory. EBMT of the input sentence is done by the following steps:

  1. Find an example in the translation memory whose source sentence is most similar to the input sentence.
  2. Modify the example, so that its source sentence maximally resembles the input sentence.
  3. Return the modified target sentence as translation result.

As you can see, some of the above steps are non-trivial. What does it mean that to sentences are "similar", how to acquire translation memory and, most importantly, how to modify the example. All of these problems are parts of EBMT.

Parallel text corpora acquisition - auxiliary technique, acquired corpora can serve as translation memories for EBMT and for many other purposes (e.g. statistical translation). However, corpora acquisition is a very interesting task itself. It consists in developing tools for automatic text extraction from documents acquired from the Internet or from any other source.

Sentence splitting - or "Sensplitting". Another auxiliary technique which, contrary to what you might think, is not trivial. Let us point out that an algorithm for sentence splitting which splits the text in the place of full stops, exclamation marks and question marks is not sufficient. Dot sign is frequently used in abbreviations, enumerations and other parts of text.

Text aligning - yet another auxiliary technique, used amon others to prepare translation memories. Aligning is done on two texts in different languages, which are each others' translations. Its goal is to align the sentences of both texts in such a way that aligned sentences are each others' translations. This operation is necessary because in practice a translation of a text almost never contains the same number of sentences as the original. During aligning it is not uncommon to have two or more sentences in one language aligned to only one sentence in the other languages. There are numerous text align algorithms. Regarding the nature of the problem, they are all quite complex.

Software engineering - apart from computational linguistics I am quite interested in software engineering, which put simply is knowledge of how to create good software. This knowledge is very useful for me from practical point of view, I apply it whenever I design a new piece of software or develop existing.