Peritus AI projects
A lot of my early software engineering skills came from my time at Peritus AI for two years. Here I try to give quick summaries of all the projects that I worked on.
-
As my first project, I built an annotation web application that handled labeling tasks for in-house NLP datasets. I wrote it in React, a great feat for a high-school graduate, and Python and Flask for the backend. Packaged in a Docker container and hosted on AWS. After deployment, we labeled around 50 different datasets for many different data science jobs in the company. Needless to say, I was in the deep end of the pool right away, and the end product actually worked and was used by people. Major success
-
After a summer internship, I wanted to try the shiny new field: machine learning. I started off with training BERT models (opens in a new tab)(in 2020 it was the bomb) to generate text embeddings for a custom summarization algorithm. It was a difficult project for various reasons ranging from my general lack of understanding of how machine learning works and up to having problems with provisioning NVIDIA GPU instances with CUDA (back when that was still quite painful).
-
We had an internal search engine based on SOLR (opens in a new tab) and we wanted to improve its performance. So, I used TF-IDF (opens in a new tab) models to extract uncommon terms from the documents we had indexed in SOLR to use as identifying tags for the documents and created a term knowledge graph to resolve synonymous technical terms(i.e. technical acronyms). Overall, we saw a 10+% improvement in the search after building and introducing the improvement.
-
During the summer of 2020, I had a new project where Peritus partnered with another startup building a custom summarization engine. My task was to build a new production pipeline that would take in technical forum posts and documentation and generate summaries based on it. This was a long one-year project where I built a text-vectorization pipeline which used various methods to embed text(TF-IDF (opens in a new tab), FastText (opens in a new tab) and BERT model (opens in a new tab)). We tried to find the most efficient and correct vectorized representation of text which would theoretically help us produce summaries. Then I build the next step of the pipeline which used the vectors with a proprietary summarization engine and added summaries to all the forum posts. My pipeline processed and summarized over 15 million documents.
-
In my last project at Peritus, I assembled a data set of over 500 documents(forum posts, technical docs, etc.) and distributed annotation tasks for ~20 people in the company(including CEO). After a few weeks of encouraging people to annotate, we had a good ground-truth dataset that we used to evaluate the summaries. As part of the effort, I developed an evaluation framework for the summaries and used the annotated dataset to evaluate the qualities of the summaries generated by my pipeline. This allowed us to perform hyper-parameter tuning and improve the existing summaries by 20-30% which was a big, qualitative jump.
The experience that I got at Peritus was truly one of a kind, and incredibly valuable. My time there quite likely drastically improved and changed the course of my life.
© Taras Priadka.RSS