Promotion of Work Experience and Research: IITK Search Engine

Monday, August 17, 2009

IITK Search Engine

Primary Proposal for IITK Search Engine to be developed by Students

Introduction: Search has become a very strong tool for getting information efficiently by saving time as well as providing accurate information. Search is done through a tool call Search Engine which browses through the websites and finds the relevant information.

Motivation: IIT Kanpur website is hosting a primary Search Engine powered by GOOGLE (a US bases private firm). This search tool is highly inefficient in terms of accuracy and precision, so, there is an urgent need to develop a search engine to provide IITK specific results. Also search is a very interesting field which will attract a number of students, faculty and researchers to come together to create expertise in this domain. Hence we aim to start an IITK Search Engine Project.

Expected Attributes of Search Engine:

User Interface: The proposed Search Engine will have a user interface like this:

Input

· Text field for searching
Features: advanced Boolean search
Specific domain search
Restrict a particular domain

· Options:
Specific domains to be included
What key words
How many results to be displayed
What results to be excluded for later
Search query input similar to Google...[site:...*,+ """,]
Directory search...and classification of search results

Display

· Relevance ranked results

Results from other search engines

· Options:
Grouping of results
Similar searches ..suggestions
Show partial information when the user mouse over a query

Back End:

· Technical details (features):

- Stay updated with new developments

- Keep track of changing server addresses

- Hypertext indexing and web mining

- Rank the results

- Display similar searches

- Give suggestions to the user

- Search based on specific domain

- Search based on format of output desired

- Image tagging and document indexing

- User feedback to improve search results

- what data structures to be used for indexing and searching

Fields that need to be covered:

· Basic logical components: crawling---indexing--searching—evaluation (monitors and measures efficiency and effectiveness of search results)

· Basic components of the architecture: effectiveness (quality of results) and efficiency (response time and throughput)

· Features:
Recall and precision improvement
How to decide upon the relevance of the searched documents
Query improvement techniques such as query suggestion, query expansion and relevance feedback, spell checking
Efficient searching and indexing
Coverage and freshness
Growing with data and users
Tuning for applications
Avoiding spam ..example...multiple indexing of same content generated in many differently named dynamic pages
Updating the indexes while processing the queries is also a design issue
Text acquisition...conversion of variety of documents into a consistent format + the metadata ...conversion if the encoding
Distributed processing for efficiency
Document crawlers for enterprise and desktop search...follow links and scan Directories (first we might try to build it)
Text transformation :like stopping,stemming,link analysis,
Information extraction, classifier, document statics, document statistics, inversion,
performance analysis

Flow of Work:

Floating of Project: 17 Aug

Last Date of submission of Search Engine ideas (if you are not having ideas right now and even if you are interested, please respond): 7 Sep

Selection of team, idea and students: 8 Sep

Project Commencement: 9 Sept

Release of different test versions: Depending upon the project proposal

Certain Queries:

You are free to use to come up with the idea of language and database you will be using but you have to justify the reason for using it.

As mentioned earlier this will be domain specific restricted for internal and external of IIT Kanpur.

What you have to submit:

You have to come up with new ideas which you can add to this proposal and you have to
suggest a way in which you will proceed for the problem. This all must be submitted in written (printed).

Incentives:

IIT Kanpur site will be hosting a search engine developed by you.

You will be certified for this work.

You may get some financial benefit in terms of stipend.

End Notes:

Developing a search engine is an evolutionary process; also search engines are creating new methods of internet use as well as they themselves are changing greatly. So, we hope that IITK will also make its presence felt in this domain in coming years.

We expect dedicated students to come forward for this challenge and make our presence felt in this field.

We will also suggest that you can think the whole search paradigm in a different perspective which may be not be existing. Who knows, it will be the next big hit :)

Interested students, please respond to power@iitk.ac.in.

The team size for the project will be 3 students. The project is supported by Office of Dean, Research and Development. So participating students may get financial incentives and recognition :)

best wishes.

Team PoWER

7 comments:

orangerindsAugust 31, 2009 at 12:26 AM
Pretty well planned.
However a few obstacles.
1. Only the needs are laid out. A lot will be needed to plan about the methodology employed in the coding process.
2. Neuro-linguistic processing is a very difficult thing to implement, especially when you want it to be something comparable to google. they have more than a 100 parameters going into making every result.
3. The team size is too small for the entire target. and since the team size will have to be increased, different methods of development which have not yet been publicly employed and tested within IIT, will have to be employed, e.g. subversioning, agility, MVC patterning etc.
ReplyDelete
Replies
orangerindsAugust 31, 2009 at 12:33 AM
Another suggestion, which could be much easier and effective.

The layout and existence of data on the entire IIT server network, is highly inefficient. there is probably no proper meta-semantics attached to any file, page, user, student, employee, documents etc. adding these itself will make the entire architecture google search friendly, which is undoubtedly one of the best searcher.
ReplyDelete
Replies
Rahul AgrawalAugust 31, 2009 at 12:41 AM
All the best :)
ReplyDelete
Replies
Raj KAugust 31, 2009 at 2:25 AM
Dear Folks
Its sounds really very interesting that you guys want to develop such a wonderful search engine that was dreamed by Tim Berners Lee in his book Information management back in 1990. He talked about semantic web. This is only possible when we have semantic data with us, I mean to say the every page or content that we have that should be properly tag with xml. XML tool is such a power, Let us talk about this? With this we can design RDF frame based on some specific ONTOLOGY. Then we can do some text mining or web mining for search engine. Hope you can invite me to some wonderful jobs out there. Waiting.
ReplyDelete
Replies
utkarsh shuklaAugust 31, 2009 at 2:26 AM
whats the stipend?
ReplyDelete
Replies
veerender kumarAugust 31, 2009 at 3:55 AM
Dear all,

the attributes defined here is an extensive set mentioning all complexities of a search engine that will be developed down the years. also the idea put here is a primitive one and requires a lot of modification and improvements, and that again will be done collectively.

the pages at IITK are not tagged properly, i accept that is why very soon, we are launching an Intensive Contact Program (ICP) to create profiles of students, alumni, labs, faculty and other researchers at IITK [Innovation Database ID]. It will be announced probably tomorrow :)

regarding semantic based....let us make a start. things will proceed to their best.

best wishes
veerender

regarding stipend, it will be announced in October after first work-package, hope for the best. the thing more imp is u doing a thing which will be a gr8 thing for IITK and may be followed/used by many universities later.

the primary team size will be 3 and wil be extended if required.
ReplyDelete
Replies
kshitizAugust 31, 2009 at 10:05 AM
Use Apache Lucene / Nutch as starting points. These are already open source search Engine framework/ Engine. Since pages are not properly tagged with meta data use keyword extraction techniques like KEA keyword extraction ( again open source implementations available). All these can be patched using some javascripts and user side mozilla plugins(Look up greasemonkey framework for developing mozilla plugins)
ReplyDelete
Replies

Promotion of Work Experience and Research

Monday, August 17, 2009

IITK Search Engine

7 comments:

PoWER

About Me

Labels

Followers

Blog Archive