Web Semantics: European Commission Language Technology

Link: Language Technology in the Europe Media Monitor of the EC's Joint Research Centre.

Summary of the Activity

At the JRC, we have been using Language Technology since 1998 to fight the information overflow and to overcome the language barrier with the purpose of supporting the European Commission and Member State institutions. To this end, commercially available and in-house tools are combined to build the IDoRA system (Intelligent Document Retrieval and Analysis). IDoRA is currently being integrated with the news gathering system Europe Media Monitor EMM. The EMM-NewsExplorer shows some of our news analysis applications.

Our tool set consists of three main components with the following functionality:

Multilingual and cross-lingual retrieval of potentially user-relevant documents. (E.g. the OSILIA project on the automatic gathering and classification of news articles from online news sites; see also the projects IDoRA for OLAF and Breaking News)

Analysis of documents and extraction of different information aspects from these documents plus language-neutral representation of this information, where possible.

Examples for the kind of analysis are:

identifying the language a document is written in (Language recognition)

identifying the keywords for a document, both free monolingual indexing terms and controlled vocabulary cross-lingual indexing terms from the EUROVOC thesaurus identifying named entities such as people's and organisations' names, geographical references, dates, currencies, etc.

products and product groups

similarity to other documents, including the identification of near-duplicate texts

detection of monolingual and cross-lingual document plagiarism;

identification of document translations

clustering of documents

classification (categorisation) of documents

relevance-ranking of documents

subject-specific summarisation

terminology extraction from subject-specific text collections
(...)

" Living and working in the multilingual and multicultural setting of the European Union, the focus of our work is on multilingual and cross-lingual applications. The ultimate goal is to give users cross-language access to information ‘hidden’ in large amounts of multilingual text, in ideally all official EU languages and more.

(((Including, for instance, Estonian-Maltese translations.)))

Acquis