Link: Language Technology in the Europe Media Monitor of the EC's Joint Research Centre.
Summary of the Activity
At the JRC, we have been using Language Technology since 1998 to fight the information overflow and to overcome the language barrier with the purpose of supporting the European Commission and Member State institutions. To this end, commercially available and in-house tools are combined to build the IDoRA system (Intelligent Document Retrieval and Analysis). IDoRA is currently being integrated with the news gathering system Europe Media Monitor EMM. The EMM-NewsExplorer shows some of our news analysis applications.
Our tool set consists of three main components with the following functionality:
Multilingual and cross-lingual retrieval of potentially user-relevant documents. (E.g. the OSILIA project on the automatic gathering and classification of news articles from online news sites; see also the projects IDoRA for OLAF and Breaking News)
Analysis of documents and extraction of different information aspects from these documents plus language-neutral representation of this information, where possible.
Examples for the kind of analysis are:
identifying the language a document is written in (Language recognition)
identifying the keywords for a document, both free monolingual indexing terms and controlled vocabulary cross-lingual indexing terms from the EUROVOC thesaurus identifying named entities such as people's and organisations' names, geographical references, dates, currencies, etc.
products and product groups
similarity to other documents, including the identification of near-duplicate texts
detection of monolingual and cross-lingual document plagiarism;
identification of document translations
clustering of documents
classification (categorisation) of documents
relevance-ranking of documents
subject-specific summarisation
terminology extraction from subject-specific text collections
(...)" Living and working in the multilingual and multicultural setting of the European Union, the focus of our work is on multilingual and cross-lingual applications. The ultimate goal is to give users cross-language access to information ‘hidden’ in large amounts of multilingual text, in ideally all official EU languages and more.
(((Including, for instance, Estonian-Maltese translations.)))
