The Minnerazzi project is a platform that allows you to build topic-specific search engines without programming knowledge. The brainchild of Dr. Edel Garcia, the Minerazzi project aims to allow anyone to build small, on-topic search indexes. His hope is that anyone, regardless of technical background, can be involved in data mining and learning through discovery by building these search indexes.
The Minerazzi Project was initially intended as an indexing project. When first conceived by Dr. Garcia, it was hosted at the Microsoft Inovation Center of Inter American University of Puerto Rico. However, the project was diluted and changed numerous times. A few weeks after initially presenting the project concept at SES New York 2012, Dr. Garcia moved the project out of the MIC and redesigned it as a self-service search platform.
A little over a year later, the Minerazzi project is in beta testing. With the help of local librarians and developers, Dr. Garcia.
Once an index is built, users can start mining email addresses, phone number and other keywords straight from search result pages. Minerazzi also allows you to identify sets of keywords with common features such as number of occurrences, byte size, etc.
For business, Minerazzi allows an organization to build a small, searchable index relevant to any specific set of data. Things like products and services, market information even a competitor index can be built quickly for employees to search and mine. Such a unique, topic-specific index can be ideal for researchers to store, share and search information.
When released to the public, the service will require users to sign up and open an account. Once that account is open, you can start crawling.
Using it is relatively simple. Pick your vertical – news, sports, etc or use something more meaningful like the local music scene, internal departmental resources and Minerazzi helps you search and index documents on that topic. Minerazzi then crawls the Web in search for your documents, when it finds matches, it adds it to your index. That data can then be searched by friends, clients, co-workers or anyone else with whom share access.
Minerazzi uses 11 different interactive search modes to help control the data that is crawled. Some modes make sense like AND, which includes all terms in your search and OR which will look for documents that match any term specified. There are other search modes like NOT AND, NOR, EXCLUSIVE OR and even PROXIMITY, which allows you to specify a number and two terms in any order that are separated by no more than the number you chose.
The science behind these modes is sound. Looking at two metrics – the ration of AND/OR search results and EXACT/AND results provide some important signals. In addition to helping with mining content from your index, these ratios also provide important clues about the nature of a search engine index and its content.
“In general, we can compute other types of search mode results ratios to extract very useful information,” Garcia said. “With some of these ratios we can estimate the organic/inorganic incompatibility of keywords in a collection.”
Garcia emphasizes that Minerazzi places users at the center of the search experience. Instead of limiting users to a list of results, Minerazzi allows users to interact more with the returned data beyond simply staring down a list of links and clicking.
“In my book, that is a technology waste. It is like sending your eyes to ‘window shopping’ across an oversized digital mall. Boring!” Garcia told Search Engine Watch. “With Minerazzi, users interact at query time with search result pages, extracting information that matter to them, and doing something with that information.”
Minnerazzi is still in beta testing with no official public launch date at this time. Garcia and his team are hoping to have it available within the coming weeks.