Peritus AI projects

Mon Jun 06 2022•projects

My early software engineering skills came from my time working at Peritus AI as an intern. I have been fortunate enough to be given a free reign to experiment and implement a bunch of my own projects. In the long term, it has been the most transformative experience which helped me realize that I love software engineering and coding.

Here's a selection of the projects that I have worked on:

As my first project, I built an annotation web app that handled labeling tasks for in-house NLP datasets. I wrote it in HTML, CSS and Javascript with Python Flask as the backend at first. Then one of the coworkers pointed out that maybe I should rewrite everything in React. So I learned React and commited to a great rewrite. The app was then packaged in a Docker container and hosted on AWS. I have learned a great ton from doing my first real app and was fortunate enough that our CTO made some of our employees use it for data annotation. We labeled around 50 different datasets for many different data science jobs in the company.
After my summer internship, I wanted to try my hands in machine learning. I started off with training BERT models (opens in a new tab) to generate text embeddings for a custom summarization algorithm. It was a difficult project for various reasons ranging from my general lack of understanding of how machine learning works and having problems with provisioning NVIDIA GPU instances with CUDA. In the hindsight, it was so cool that I had the resources and the mentorship to take a stab at a technology which completely revolutionized the AI landscape when I was out of high school.
We had an internal search engine based on SOLR (opens in a new tab) and we wanted to improve its performance. I used TF-IDF (opens in a new tab) models to extract uncommon terms from the documents we indexed in SOLR to use as document tags. I then created a term knowledge graph to resolve synonymous technical terms(i.e. RDB is a synonym for Relationonal Database). This was a cool project since we saw around a 10% improvement in the search results after adding my feature.
During the summer of 2020, I had a new project where my task was to build a pipeline that summarizes technical forum posts and documentation. Currently text summarization is a solved problem via LLMs, however, just in 2020 text summarization was a very difficult problem to tackle. You had an option of using classical ML to try to extract summaries via work frequencies(which didn't work that well), or use something like BERT which was very new, incredibly expensive to train and serve. Additionally, I was 1 year in community college, so I didn't learn ML and data science extensively yet, so I had to take Andrew Ng(the GOAT) Coursera courses to learn things that I needed for the job. Overall, summarization algorithms at the time weren't good enough to produce impressive generated summaries. But I got great experience building system which ultimately processed around 15 million documents.
In my last project at Peritus, I assembled a data set of over 500 documents(forum posts, technical docs, etc.) to evaluate the performance of summarization pipeline that I have built before. Our CTO once again helped to push everyone in the company (~20 people including the CEO) to annotate data which resulted in a good ground-truth dataset. Afterwards, I developed an evaluation framework for the summaries and used the annotated dataset to evaluate the qualities of the summaries generated by my pipeline. Afterwards, I was able to tune hyper-parameters of the summarization pipeline and improve the existing summaries by 20-30% which was a big, qualitative jump.

The experience that I got at Peritus was truly one of a kind, and incredibly valuable. My time there drastically improved and changed the course of my life.