Tuesday, December 14, 2004

Google

I have nothing to add to all the praise Google gets. I don't think it's worth our valuable time to post here praising google, simply because you all know already how good it is. So, instead, I'll try to explain why you do not want to use Google for your company, and why you would fare much better with a different approach.

It's no secret that Internet & Intranet are different worlds. Indexing the Internet, apart from a capacity problem, is easy. Don't run away yet. What Google cracked is how to index the Internet rapidly and reliably with some cost-effectiveness (plus all those fantastic applications), and keeping it simple for us.

But that does not mean it will work out for you inside your company.

If you're not familiar with Google's page ranking algorithm, please read this before (or this one for the patented Pigeon Ranking TM).

The reasons why you will not be "breaking through" by implementing a full-text search engine (even if it is Google) in your company can be broken down as follows:

Page Ranking

Google's page ranking is built for a distributed network made of thousands of independent sites. And it's exactly the way Google's page rank works that will make it fail in your company. In a typical company search, you need to find the documents and knowledge that is hidden from you. You know most of the times quite a lot about what type of document(s) you're searching for, you know who could have created it, you know to which project it applies, and you don't want to have press releases on your results.

Intranets do not have links to documents, except for those you already read because, well, there's a link to it on the Intranet... Unless you're working for IBM, where there will be about 3000 unofficial Intranet web pages running on old desktops hidden under tables and DEMO AIX servers. ;-)

Security Constraints

Implementing a search engine within an organisation is not simple (trust me, this is the voice of experience), and very seldom will the problems lie within the search application. Most of the time the problem is deciding how much knowledge do you want to give to your users. Knowledge-based companies, as is the case for a pharma company where we ran a pilot for search technology, just want to share everything.

That's nice you might say.

Well, think again. When they want to share everything, including their own mail boxes, things start to get shady and politics come into play.

Others want to index all their systems, including their ultra-confidential HR system. Then security policies come into play, because you must make sure to respect the access controls of the "hosting" application.

This basically means that you have to:
  1. Authenticate your user before he searches
  2. Use a proxy to search the index(es) and/or
  3. Hide results that he should not have access to
  4. Show the result list
Easier said than done.

Format constraints

You know your network better than me, I must assume. So, you'll know that knowledge is not only in word, excel and powerpoint. It's in Exchange public folders, Notes databases, Oracle databases, Peoplesoft, text files, home-grown applications, MS-Access, web sites (internal & external), PDF, zip files, xml files, SAP, Siebel, Domino.doc, Documentum, Tridion, whatever-that-freeware-was-called-again, etc, etc, etc...

Try tapping into that with Google. Well, it will work for the most of them.

Language constraints

And then, the major blocking issue: Language. No self-respected european organization will have documents in one language only. No way, that would be too simple. And you, as a self-respected european will also not be able to read one language only. Yet, you must search in one language at the time if you're using a simple text search engine (even if is Google).


So, in other words, if you're looking for a search engine solution for your Intranet, don't settle for Google. But don't look at Retrievalware either, it's too bloated and expensive for what you probably need.

If you need a search engine that can:
  • Support multiple languages
  • Respect distributed authentication
  • Link to multiple data sources using an open plugin or plugin-like architecture
  • Do hierarchical and non-hierarchical search (search within specific domains or in general)
  • Categorize and intelligently group documents, subjects and authors
  • Does not cost too much to implement (eh eh)
Well, your only solution is to look around and maybe settle for a smaller, but stable, player. Retrievalware might be what you need, it does respect the list I show above (except for the last point), but there are other players in the market.

So, I went and searched for them. Here's the ones I settled with, they do show good promise:

Search/Categorization
Sinequa
Lingway

As a personal opinion, I preferred the attitude of the Sinequa people, they seem to have a more "can-do" approach to projects, which is always nice. Technically both systems seem to be as effective.

Data Mining
Intellixir

Intellixir is very impressive on the data mining field. Definitely worth taking a look at.

And that's it for today. Still couldn't finish my report though... ;-)

Nuno

No comments: