For a long time I have been developing and promoting websites. To understand how search engines rank sites, I created all kinds of scripts to parse information on the web. Being a researcher by nature, and a Physicist by education, I could not refuse the opportunity to experiment, even with text, and not with nature. Thanks to old projects, I could afford to spend all my time researching this area. Initially, my goal was not to create a search engine. I was interested in such questions: how to determine the language of textual information, what is the statistical distribution of words in different languages of the world, what are the most common word combinations in large volumes of textual information. These were not very difficult tasks, and they had been solved by someone for a long time, but I liked the process itself, and I got, and continue to get pleasure from it.
I will give an example of one of my studies, which consisted in finding a limit on the number of possible phrases. Below you can see a graph (for English), on which the X-axis is the number of scanned pages, and the Y-axis is the number of phrases from 1, 2, 3, 4 and 5 words.
I assumed that at some point a limit should come when the curves approach the boundaries of some numbers. For the English language, about 18 million phrases were found, while the graphs corresponding to phrases of 3, 4 and 5 words were clearly not going to slow down their growth. The server's RAM ran out, at that time it was 32GB, and the experiment had to be stopped.
And so I already had well-working scripts for defining the language of text information and clearing html pages from tags. They were all written in C, which made them incredibly fast compared to PHP. I also noticed that it is more convenient and faster for me to work with data if they are in files. As a result, I completely abandoned the use of any databases. This, of course, forced me to develop my own algorithms for searching through arrays, and myself to implement other technical points related to storing information. As a result, I got complete freedom, and a wide field for experiments with text.
Cleaning up html pages is another challenge. Many Internet sites contain a large number of errors in the form of unclosed tags and quotation marks. If you use standard methods, then html-code begins to get into the clean text.
It seemed to me that there was a small step left before the creation of the search engine, but how wrong I was. More than 3 years of hard work and over 20,000 lines of code have turned into the Kavunka search engine.