Search for project

More than 2000 projects issued through our platform and this
amount increases very fast.

Crawler (ID:4524)

Project Creator: nilsson
FC Member For 6104 Days
Credits 20
Completed Proj. Num. 0 / 0
Total payment USD
Avg Daily Online
0.00 h (From 21/5/2007)
Available on MSN/Skype No
Last Login 6/10/2008
Peers Rating 0.00%
      
Budget: Not Sure/Confidential
Created: 6/10/2008 3:01:57 AM EST
Bidding Ends: 8/9/2008 3:01:57 AM EST
( Expired )
Development Cycle: 14 Days
Bid Count: 4
Average Bid: 850.25  
Project Description:

I am only interested in provider with experience in crawler development.

Specification:

1) Search for all sites in a particular country that meet the specific subject/search terms. This could be achieved by querying Google to get a list of sites to crawl. This will then create a list of sites to crawl regularly.

2) Crawl the list of sites from step 1 and search for a specific types of items on the crawled web sites in the list.

3) If the item type is found on the web site then extract the data from the web pages in as clean a way as possible. This process will remove as many HTML, CSS and other tags as possible to acquire data that is relevent.

4) Write the extracted content to a table in a MySQL database.

All code to be PHP 5+ compatible and to run as a windows executable or as PHP script.

1. The crawler is to crawl yellowpage types sites (indexes and directories) sites in 20-30 different countries and grab information about company names, emails, addresses, phones, logos etc. The information grabbed must be structured/indexed in categories as it is in a phone book if possible.
2. The crawler identifies the customer care email and the email address where a new customer can inquire to become customer/ask a question
3. The crawler sends an email with a question to see if the email gets be answered
4. The crawler finds the url/web address of the companies from step 1
5. The crawler goes to the company sites and searches for prices and products in selected categories: Bank, Insurance, Pension, Mobile, Mortgage and others.

ADMIN CONSOLE
The software must have a management console enabling the following functions:
1. Must be able to deal with any site and extract any type of information.
2. Must be able to customize regional options so it can be told to crawl sites in a specific domain using the words for the domain.
3. Automated Scheduling of crawler for target site [hourly, daily, weekly, bi-weekly, etc] for the crawler to run.
Reporting of crawl progress, results, log.
Exception handling ??providing details of items not crawled.
Duplicate email address handling IS paramount, to delete duplicate listings email addresses.
4. Deletion handling to recognize that previously crawled listings/email addresses are no longer listed on the target site and to handle these accordingly by ignoring these listings into an inactive or archive table separate to the main listings
5. Backup functions to enable all of the database to be backed up.
6. The ability to easily search for and edit and remove email address records.
Write the extracted content to a table in a MySQL database.
7. It should be possible to configure the crawler for different sites. Ease-of-use is important.
8. The crawler should be very fast, not slow due to bad programming.
9. The web crawler must be able to put all of the companies into categories.
10. The crawler should be able to enter sites from different Proxies, so that the sites do not detect suspicious behavior.

QUESTIONS
1. Which solution do you recommend?
2. Can you provide a demo?

For scanning forums:
Search forums by thread and try to determine what the thread is about using sets of keywords with associated weights. Thus a thread containing words like summer, bike and canoe could be weighed as something about leisure. The keywords and their weights have to be easily configurable. If a thread has enough weight to be considered fully about a category then it is to be recorded in a table under t hat category.



Job Type C/C++, Java, PHP, Visual Basic, Other 
Attached Files: N/A

Bids placed

(There are 4 bids on this project, these are listed below)
 
Contact*:
Email*:
Telephone:
(Include country code)
Enquiry*