Duplicates detection

Obor: IT
Lokalita: Praha
Druh úvazku: Stáž
Požadované vzdělání: Ostatní
Zveřejněno: 7. 4. 2022

O společnosti

JetBrains is a technology-leading software development company specializing in the creation of intelligent, productivity-enhancing software. At JetBrains, code is our passion. For over 22 years we have strived to make the strongest, most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover and create. Its product catalog includes award-winning tools such as IntelliJ IDEA, CLion, and ReSharper, and its IntelliJ Platform has been chosen by a variety of companies to build their own tooling on, including Google’s Android Studio. Also, JetBrains is the creator of the programming language Kotlin.

Popis pozice

The project team is researching various approaches to detecting explicit and potential duplicates in text written in the IDE in natural languages (English to start with). 

Our IntelliJ plugin helps streamline the process of writing technical documentation for a software application inside the IDE. It supports the concept of ‘single source’, which means that a chunk of content can be written once and re-used in multiple help articles or documentation outputs by including it by ID. 

The proposed inspection should be able to detect identical pieces of content, or, more importantly non-explicit duplicates so that they can be extracted to a library and reused. Such inspection should help:

  • maintain consistency throughout sources
  • avoid making multiple updates when an UI changes
  • reduce on the review and editing effort
  • reduce on localization costs

Comparing each chunk of text with all other chunks and suggesting duplicates based on the percentage of matches is not a task that can be run in the IDE at runtime on a large code base, so we expect you to research, try and test different approaches that may include ML, Elasticsearch, trigram search, the Apache Lucene engine, and whatever other approaches you can apply.

As a result of the internship project, we would expect you to: 

  • Create an inspection that can be run in the IDE or in an external web interface in the headless mode and provide data on the potential duplicates
  • An intention action in the IDE that would suggest extracting such duplicates to reusable chunks
  • An inspection that would analyze duplicates in the background and suggest replacing content with an existing chunk as you type (ideally) 

Co hledáme

  • Java/Kotlin knowledge
  • Basic knowledge of natural language processing
  • English (pre-intermediate and above)