Is there a toolkit for recurring neural networks

Deep learning - the master key for document analysis? (Part 2)

Machine learning processes - state-of-the-art

Complex learning processes require powerful hardware. The corresponding algorithms require massive parallel operations. Since the number of cores in today's CPUs is not yet sufficient for this, more and more graphics types are used that can execute several thousand floating point operations simultaneously.

That this is indispensable can be illustrated with a small numerical example. In a medium-sized company, 100 million documents are to be analyzed. That is a typical order of magnitude for companies or authorities with several thousand employees. If you only need 1 second per document for these analyzes, you would need more than 3 years of analysis time for sequential processing. This example quickly shows that it is essential to use effective methods and to fall back on the latest technology.

Neural networks - modern solutions for semantic analyzes

For some years now, processes based on special neural networks, the application of which are often summarized under the term deep learning, have promised a major breakthrough.

With deep learning, great advances have been made in automatic translation, image analysis and semantic analysis of texts. In the case of the semantic analysis of documents, there have been approaches for using neural networks for many years. This use case has experienced a significant boost since 2013, in particular through publications by Google [1]. Using a clever combination of methods that have already been known for a long time, Google was able to provide a toolkit for text analysis (Word2Vec [2]), with which semantic relations between words can be determined very efficiently on modern hardware.

Despite all the euphoria, one should not forget that semantic analyzes are diverse and one toolkit cannot cover all purposes.

Document analyzes sometimes have completely different objectives, such as

  • Extraction of properties (metadata) for the keywording and for the provision of filter criteria for the search
  • Classification of documents according to certain categories
  • Find semantic relationships between terms, topics and various documents
  • Automatic creation of a company-specific thesaurus
  • Statistics on various properties of the document content
  • Automatic translation

and more.

Deep learning as a modern form of semantic analysis

Deep learning is a modern form of semantic analysis. The preparation of digital documents has always included analyzes (or "analytics" in new German). In addition to examining the document content and its other properties, this also includes analyzing the use of the documents (access frequencies, etc.). Texts are analyzed using a wide variety of methods and meaningful terms (personal names, e-mail addresses, product names, order numbers, etc.) are extracted. In addition to neural networks, classic static and rule-based methods are also used for this.

The results of semantic analyzes are saved in a knowledge graph. It forms the knowledge base for assistants and all other forms of user guidance. This makes it clear that analytics cannot be reduced to the visualization (charts) of statistical data.

Analytical functions - adapted to application scenarios

All of these wonderful new possibilities that deep learning opens up for us need to be seamlessly integrated into easy-to-use applications. The different application scenarios accordingly require different semantic functions:

Discovery & Monitoring

In order not to have to always carry out the same activities during ongoing research, it is advisable to have recurring research run automatically via stored inquiries and to automatically make the research results available in a dashboard or cockpit or in reports (e.g. find everything or find new things on the subject of XYZ). The preceding analyzes provide the relevant metadata and categories for such queries.

exploration

In order to develop a large and rather unknown body of information, navigation based on hierarchical structures is required. Filter chains (facets) and visualizations of the structures, e.g. in hyperbolic trees, are very helpful. Such structures are difficult to obtain automatically, but documents can be automatically classified into given structures by means of classification.

Ad hoc search

The most common form of research is the spontaneous search for supposedly available information. Search processes can be controlled very well by analysis results with the help of assistants and recommendation mechanisms.

Methodological aspects in the acquisition of structural knowledge

Vocabulary-based computer linguistic methods in use up to now are often blind to new aspects in the analyzed content.

In contrast, synonyms, related concepts and conceptual analogies can be learned automatically with the help of neural networks and used for user guidance and suggestion wizards.

The task is to reconcile automatically learned terms and relationships with known structural knowledge. Automatic processes must be based on company-specific aspects. There is no silver bullet for this, but a basic approach.

The focus is on the organizational structure, business processes, topics and of course people in your own company or with business partners. The first basic relationships are already contained in data structures. When the systems are integrated, they are mapped to a cross-system information model. The organizational structure can also be derived from directory services or the like. Only a few basic concepts that have a cross-structural effect and cannot be obtained from existing data sources need to be explicitly stored in an information model, that is, they need to be editorially maintained.

Understand terms in context

The application of the word models generated by deep learning can best be illustrated with an example. If a user enters "Tor" while searching for information, this single search term alone does not allow any conclusions to be drawn about what the user means by Tor. This term can mean a specific goal in a soccer game, a garage door or a sight. If you analyze all available information before the search (in this example it is newspaper articles), you can learn these different meanings automatically and offer the searcher the terms actually contained in the information available on the topic as a suggestion for refining his search.

In the picture, terms with different contexts are colored in different colors: blue - other inflections of goal, red - terms for goal in the sense of door / entrance, green - goal in the context of football (a). Since there are a lot of terms related to goal in the context of football, a purple cluster has formed with secondary terms from the context of football that only indirectly have something to do with goal itself.

If this purple context is of interest, clicking on one of the purple terms will lead you to an expanded word cloud (b), which offers even more terms for possible information about football. In this way, the searcher is automatically directed to the information available in various contexts despite a search term that is inadequate for a targeted search. This use case of deep learning can be easily integrated into a search and is very useful because all the necessary context descriptions are learned automatically.

Secondary terms from the context of “football”

Conclusion

Deep learning based on neural networks has great potential for analyzing digital content. In combination with other analysis methods and thanks to the performance of today's hardware, increasingly economical solutions for supporting business processes can be implemented.

[1] T. Mikolov, K. Chen, G. Corrado and J. Dean. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv: 1301.3781, 2013
[2] https://en.wikipedia.org/wiki/Word2vec

(Published in DOK.magazin, issue 4/2016)