Computer

Friday, April 13, 2007

Web Search Engines: Part 1

In 1995, when the number of "usefully searchable" Web pages was a few tens of millions, it was widely believed that "indexing the whole of the Web" was already impractical or would soon become so due to its exponential growth. A little more than a decade later, the GYM search engines—Google, Yahoo!, and Microsoft—are indexing almost a thousand times as much data and between them providing reliable subsecond responses to around a billion queries a day in a plethora of languages.

If this were not enough, the major engines now provide much higher-quality answers. For most searchers, these engines do a better job of ranking and presenting results, respond more quickly to changes in interesting content, and more effectively eliminate dead links, duplicate pages, and off-topic spam.

In this two-part series, we go behind the scenes and explain how this data processing "miracle" is possible. We focus on whole-of-Web search but note that enterprise search tools and portal search interfaces use many of the same data structures and algorithms.

Search engines cannot and should not index every page on the Web. After all, thanks to dynamic Web page generators such as automatic calendars, the number of pages is infinite.

To provide a useful and cost-effective service, search engines must reject as much low-value automated content as possible. In addition, they can ignore huge volumes of Web-accessible data, such as ocean temperatures and astrophysical observations, without harm to search effectiveness. Finally, Web search engines have no access to restricted content, such as pages on corporate intranets.

What follows is not an inside view of any particular commercial engine—whose precise details are jealously guarded secrets—but a characterization of the problems that whole-of-Web search services face and an explanation of the techniques available to solve these problems.

INFRASTRUCTURE

Figure 1 shows a generic search engine architecture. For redundancy and fault tolerance, large search engines operate multiple, geographically distributed data centers. Within a data center, services are built up from clusters of commodity PCs. The type of PC in these clusters depends upon price, CPU speed, memory and disk size, heat output, reliability, and physical size (labs.google.com/papers/googlecluster-ieee.pdf). The total number of servers for the largest engines is now reported to be in the hundreds of thousands.

Within a data center, clusters or individual servers can be dedicated to specialized functions, such as crawling, indexing, query processing, snippet generation, link-graph computa- tions, result caching, and insertion of advertising content. Table 1 provides a glossary defining Web search engine terms.

Large-scale replication is required to handle the necessary throughput. For example, if a particular set of hardware can answer a query every 500 milliseconds, then the search engine company must replicate that hardware a thousandfold to achieve throughput of 2,000 queries per second. Distributing the load among replicated clusters requires high-throughput, high-reliability network front ends.

Currently, the amount of Web data that search engines crawl and index is on the order of 400 terabytes, placing heavy loads on server and network infrastructure. Allowing for overheads, a full crawl would saturate a 10-Gbps network link for more than 10 days. Index structures for this volume of data could reach 100 terabytes, leading to major challenges in maintaining index consistency across data centers. Copying a full set of indexes from one data center to another over a second 10-gigabit link takes more than a day.

CRAWLING ALGORITHMS

The simplest crawling algorithm uses a queue of URLs yet to be visited and a fast mechanism for determining if it has already seen a URL. This requires huge data structures—a simple list of 20 billion URLs contains more than a terabyte of data.

The crawler initializes the queue with one or more "seed" URLs. A good seed URL will link to many high-quality Web sites—for example, www.dmoz.org or wikipedia.org.

Crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue. When the crawler fetches the page, it scans the contents for links to other URLsand adds each previously unseen URL to the queue. Finally, the crawler saves the page content for indexing. Crawling continues until the queue is empty.

Real crawlers

In practice, this simple crawling algorithm must be extended to address the following issues.

Speed. If each HTTP request takes one second to complete—some will take much longer or fail to respond at all—the simple crawler can fetch no more than 86,400 pages per day. At this rate, it would take 634 years to crawl 20 billion pages. In practice, crawling is carried out using hundreds of distributed crawling machines.

A hashing function determines which crawling machine is responsible for a particular URL. If a crawling machine encounters a URL for which it is not responsible, it passes it on to the machine that is responsible for it.

Even hundredfold parallelism is not sufficient to achieve the necessary crawling rate. Each crawling machine therefore exploits a high degree of internal parallelism, with hundreds or thousands of threads issuing requests and waiting for responses.

Politeness. Unless care is taken, crawler parallelism introduces the risk that a single Web server will be bombarded with requests to such an extent that it becomes overloaded. Crawler algorithms are designed to ensure that only one request to a server is made at a time and that a politeness delay is inserted between requests. It is also necessary to take into account bottlenecks in the Internet; for example, search engine crawlers have sufficient bandwidth to completely saturate network links to entire countries.

Excluded content. Before fetching a page from a site, a crawler must fetch that site's robots.txt file to determine whether the webmaster has specified that some or all of the site should not be crawled.

Figure 1 image

Figure 1. Generic search engine architecture. Enterprise search engines must provide adapters (top left) for all kinds of Web and non-Web data, but these are not required in a purely Web search.

Duplicate content. Identical content is frequently published at multiple URLs. Simple checksum comparisons can detect exact duplicates, but when the page includes its own URL, a visitor counter, or a date, more sophisticated fingerprinting methods are needed.

Crawlers can save considerable resources by recognizing and eliminating duplication as early as possible because unrecognized duplicates can contain relative links to whole families of other duplicate content.

Search engines avoid some systematic causes of duplication by transforming URLs to remove superfluous parameters such as session IDs and by casefolding URLs from case-insensitive servers.

Continuous crawling. Carrying out full crawls at fixed intervals would imply slow response to important changes in the Web. It would also mean that the crawler would continuously refetch low-value and static pages, thereby incurring substantial costs without significant benefit. For example, a corporate site's 2002 media releases section rarely, if ever, requires recrawling.

Interestingly, submitting the query "current time New York" to the GYM engines reveals that each of these engines crawls the www.timeanddate.com/worldclock site every couple of days. However, no matter how often the engines crawl this site, the search result will always show the wrong time.

To increase crawling bang-per-buck, a priority queue replaces the simple queue. The URLs at the head of this queue have been assessed as having the highest priority for crawling, based on factors such as change frequency, incoming link count, click frequency, and so on. Once a URL is crawled, it is reinserted at a position in the queue determined by its reassessed priority. In this model, crawling need never stop.

Spam rejection. Primitive spamming techniques, such as inserting misleading keywords into pages that are invisible to the viewer—for example, white text on a white background, zero-point fonts, or meta tags—are easily detected. In any case, they are ineffective now that rankings depend heavily upon link information (www-db.stanford.edu/pub/papers/google.pdf ).

Modern spammers create artificial Web landscapes of domains, servers, links, and pages to inflate the link scores of the targets they have been paid to promote. Spammers also engage in cloaking, the process of delivering different content to crawlers than to site visitors.

Search engine companies use manual and automated analysis of link patterns and content to identify spam sites that are then included in a blacklist. A crawler can reject links to URLs on the current blacklist and can reject or lower the priority of pages that are linked to or from blacklisted sites.

FINAL CRAWLING THOUGHTS

The full story of Web crawling must include decoding hyperlinks computed in JavaScript; extraction of indexable words, and perhaps links, from binary documents such as PDF and Microsoft Word files; and converting character encodings such as ASCII, Windows codepages, and Shift-JIS into Unicode for consistent indexing (www.unicode.org/standard/standard.html).

Engineering a Web-scale crawler is not for the unskilled or fainthearted. Crawlers are highly complex parallel systems, communicating with millions of different Web servers, among which can be found every conceivable failure mode, all manner of deliberate and accidental crawler traps, and every variety of noncompliance with published standards. Consequently, the authors of the Mercator crawler found it necessary to write their own versions of low-level system software to achieve required performance and reliability (www.research.compaq.com/SRC/mercator/papers/www/paper.html).

It is not uncommon to find that a crawler has locked up, ground to a halt, crashed, burned up an entire network traffic budget, or unintentionally inflicted a denial-of-service attack on a Web server whose operator is now very irate.

Part two of this two-part series (Computer, How Things Work, Aug. 2006) will explain how search engines index crawled data and how they processes queries.

Tuesday, March 27, 2007

Microsoft Office Professional 2007

Microsoft Office Professional 2007 is a complete suite of productivity and database software that will help you save time and stay organized. Powerful contact management features help you manage all customer and prospect information in one place. Develop professional marketing materials for print, e-mail, and the Web, and produce effective marketing campaigns in-house. Create dynamic business documents, spreadsheets, and presentations, and build databases with no prior experience or technical staff. You will learn new features rapidly using improved menus that present the right tools when you need them.

Click here for the Office 2007 Product Comparison Chart.

Office Professional 2007 includes:

Microsoft Access 2007
With its improved interface and interactive design capabilities that do not require deep database knowledge, Microsoft Office Access 2007 helps you track and report information with ease. Get started quickly with prebuilt applications that you can modify or adapt to changing business needs. Collect information through forms in e-mail or import data from external applications. Create and edit detailed reports that display sorted, filtered, and grouped information in a way that helps you make sense of the data for informed decision-making. Share information by moving your Office Access 2007 files to a Windows SharePoint Services Web site, where you can audit revision history, recover deleted information, set data access permissions, and back up your information at regular intervals.

Microsoft Excel 2007
Microsoft Office Excel 2007 is a powerful tool you can use to create and format spreadsheets, and analyze and share information to make more informed decisions. With the new results-oriented interface, rich data visualization, and PivotTable views, professional-looking charts are easier to create and use. Office Excel 2007, combined with Excel Services, a new technology that will ship with Microsoft Office SharePoint Server 2007, provides significant improvements for sharing data with greater security. You can share sensitive business information more broadly with enhanced security with your coworkers, customers, and business partners. By sharing a spreadsheet using Office Excel 2007 and Excel Services, you can navigate, sort, filter, input parameters, and interact with PivotTable views directly on the Web browser.

Microsoft Outlook 2007
Microsoft Office Outlook 2007 provides an integrated solution for managing your time and information, connecting across boundaries, and remaining in control of the information that reaches you. Office Outlook 2007 delivers innovations you can use to quickly search your communications, organize your work, and better share your information with others -- all from one place.

Microsoft PowerPoint 2007
Microsoft Office PowerPoint 2007 enables users to quickly create high-impact, dynamic presentations, while integrating workflow and ways to easily share information. From the redesigned user interface to the new graphics and formatting capabilities, Office PowerPoint 2007 puts the control in your hands to create great-looking presentations.

Microsoft Publisher 2007
Microsoft Office Publisher 2007 helps you create, personalize, and share a wide range of publications and marketing materials in-house. New and improved capabilities guide you through the process of creating and distributing in print, Web, and e-mail so you can build your brand, manage customer lists, and track your marketing campaigns -- all in-house.

Microsoft Word 2007
Word 2007 is a powerful authoring program that gives you the ability to create and share documents by combining a comprehensive set of writing tools with an easy-to-use interface. Office Word 2007 helps information workers create professional-looking content more quickly than ever before. With a host of new tools, you can quickly construct documents from predefined parts and styles, as well as compose and publish blogs directly from within Word. Advanced integration with Microsoft Office SharePoint Server 2007 and new XML-based file formats make Office Word 2007 the ideal choice for building integrated document management solutions.

Create High-Quality Documents
Office Professional 2007 has new graphics capabilities, formatting galleries, and an improved user interface that exposes commonly used commands, making it even easier for you to produce high-quality documents that you can be proud of. New features and improvements include:

A results-oriented user interface that makes it easier to find and use product features.
More stable bullets and numbering to help you consistently format documents.
Enhanced text effects, SmartArt diagrams, and graphics and charting galleries that provide more formatting choices.
Document Themes that help you create a consistent appearance across Microsoft Office system programs.

Work with enhanced reliability and security features
With an improved junk e-mail filter and anti-phishing tools, automatic document recovery, and Document Inspector for removing personally identifiable information from your documents, Office Professional 2007 enables you to work with more confidence and security. New features and improvements include:

A junk e-mail filter that helps significantly reduce spam e-mail messages.
Anti-phishing tools that alert users to suspicious and potentially fraudulent e-mail messages.
Automatic document recovery tool that helps retrieve Microsoft Office documents after a system stops responding.
Document Inspector that detects and removes personally identifiable information, comments, and tracked changes from documents.

Find commands and help with ease
Office Professional 2007 has a new streamlined user interface and an enhanced Help system, including online tutorials with step-by-step instructions, so you can quickly learn about the programs and find answers to your questions. New features and improvements include:

Command tabs on the results-oriented Ribbon that display commonly used commands that previously appeared only in lengthy drop-down menus.
An improved Help system that offers a smooth transition between the Help menu in the Microsoft Office system and Help on the Internet (when connected). Larger, more informative enhanced ScreenTips provide help about commands.
Command tabs that are context-sensitive and change automatically depending upon the task that you are trying to complete.
Online tutorials that provide step-by-step instructions for common tasks.

Organize your time and communications
Office Professional 2007 includes Microsoft Office Outlook 2007, which is now an even more complete and easy-to-use e-mail and appointment manager, leaving you more time to do the things you want to do. You can quickly search throughout your e-mail, share your calendar with the people you care about, and get the latest news from your favorite Web sites using Really Simple Syndication (RSS). New features and improvements include:

Instant Search that helps you quickly find information in any of the Outlook modules such as e-mail, calendar, and contacts.
A To-Do Bar that brings together tasks, appointments, and flagged e-mail messages in one place.
Color Categories that help you quickly differentiate e-mail messages.
An RSS aggregator that offers you the opportunity to subscribe to and read Web content in Office Outlook 2007. (A separate fee-based RSS subscription is required.)

Create professional marketing materials and campaigns in-house
Create and distribute professional and compelling marketing materials and campaigns entirely in-house with Office Professional 2007. Create designer-quality marketing materials for print, e-mail, and the Web using Office Publisher 2007. Use Office Outlook 2007 with Business Contact Manager and Office Publisher 2007 together to track and manage marketing campaign activities such as compiling mailing lists, distributing materials, and tracking results. You also can use the library of customizable templates in Microsoft Office PowerPoint 2007 to create professional-looking presentations. Office Professional 2007 enables you to:

Create and publish a wide range of marketing materials for print, e-mail, and the Web with your own brand elements including logo, colors, fonts, and business information using Office Publisher 2007.
Use hundreds of professionally designed and customizable templates, and more than 100 blank publication types provided by Office Publisher 2007.
Reuse text, graphics, and design elements, and convert content from one publication type to another with Office Publisher 2007.
Use Office Publisher 2007 to combine and filter mailing lists and data from multiple sources -- including Office Excel 2007, Office Outlook 2007, Office Outlook 2007 with Business Contact Manager, and Microsoft Office Access 2007 -- to create personalized print and e-mail materials, and build custom collateral such as catalogs and datasheets.
Create, manage, and track marketing campaigns using Office Outlook 2007 with Business Contact Manager.
Create more dynamic presentations from an extensive library of customizable themes and slide layouts using Office PowerPoint 2007.
Create powerful charts, SmartArt diagrams, and tables, and then quickly preview formatting changes using the new graphics tools in Office Word 2007, Office Excel 2007, and Office PowerPoint 2007.

Friday, March 9, 2007

history of computer

It is difficult to define any one device as the earliest computer. The very definition of a computer has changed and it is therefore impossible to identify the first computer. Many devices once called "computers" would no longer qualify as such by today's standards.

Originally, the term "computer" referred to a person who performed numerical calculations (a human computer), often with the aid of a mechanical calculating device. Examples of early mechanical computing devices included the abacus, the slide rule and arguably the astrolabe and the Antikythera mechanism (which dates from about 150-100 BC). The end of the Middle Ages saw a re-invigoration of European mathematics and engineering, and Wilhelm Schickard's 1623 device was the first of a number of mechanical calculators constructed by European engineers.

However, none of those devices fit the modern definition of a computer because they could not be programmed. In 1801, Joseph Marie Jacquard made an improvement to the textile loom that used a series of punched paper cards as a template to allow his loom to weave intricate patterns automatically. The resulting Jacquard loom was an important step in the development of computers because the use of punched cards to define woven patterns can be viewed as an early, albeit limited, form of programmability.

In 1837, Charles Babbage was the first to conceptualize and design a fully programmable mechanical computer that he called "The Analytical Engine".^[2] Due to limited finance, and an inability to resist tinkering with the design, Babbage never actually built his Analytical Engine.

Large-scale automated data processing of punched cards was performed for the US Census in 1890 by tabulating machines designed by Herman Hollerith and manufactured by the Computing Tabulating Recording Corporation, which later became IBM. By the end of the 19th century a number of technologies that would later prove useful in the realization of practical computers had begun to appear: the punched card, boolean algebra, the vacuum tube (thermionic valve) and the teleprinter.

During the first half of the 20th century, many scientific computing needs were met by increasingly sophisticated analog computers, which used a direct mechanical or electrical model of the problem as a basis for computation. However, these were not programmable and generally lacked the versatility and accuracy of modern digital computers.

A succession of steadily more powerful and flexible computing devices were constructed in the 1930s and 1940s, gradually adding the key features that are seen in modern computers. The use of digital electronics (largely invented by Claude Shannon in 1937) and more flexible programmability were vitally important steps, but defining one point along this road as "the first digital electronic computer" is difficult (Shannon 1940). Notable achievements include:

EDSAC was one of the first computers to implement the stored program (von Neumann) architecture.

Konrad Zuse's electromechanical "Z machines". The Z3 (1941) was the first working machine featuring binary arithmetic, including floating point arithmetic and a measure of programmability. In 1998 the Z3 was proved to be Turing complete, therefore being the world's first operational computer.
The Atanasoff-Berry Computer (1941) which used vacuum tube based computation, binary numbers, and regenerative capacitor memory.
The secret British Colossus computer (1944), which had limited programmability but demonstrated that a device using thousands of tubes could be reasonably reliable and electronically reprogrammable. It was used for breaking German wartime codes.
The Harvard Mark I (1944), a large-scale electromechanical computer with limited programmability.
The US Army's Ballistics Research Laboratory ENIAC (1946), which used decimal arithmetic and was the first general purpose electronic computer, although it initially had an inflexible architecture which essentially required rewiring to change its programming.

Several developers of ENIAC, recognizing its flaws, came up with a far more flexible and elegant design, which came to be known as the stored program architecture or von Neumann architecture. This design was first formally described by John von Neumann in the paper "First Draft of a Report on the EDVAC", published in 1945. A number of projects to develop computers based on the stored program architecture commenced around this time, the first of these being completed in Great Britain. The first to be demonstrated working was the Manchester Small-Scale Experimental Machine (SSEM) or "Baby". However, the EDSAC, completed a year after SSEM, was perhaps the first practical implementation of the stored program design. Shortly thereafter, the machine originally described by von Neumann's paper—EDVAC—was completed but didn't see full-time use for an additional two years.

Nearly all modern computers implement some form of the stored program architecture, making it the single trait by which the word "computer" is now defined. By this standard, many earlier devices would no longer be called computers by today's definition, but are usually referred to as such in their historical context. While the technologies used in computers have changed dramatically since the first electronic, general-purpose computers of the 1940s, most still use the von Neumann architecture. The design made the universal computer a practical reality.

Microprocessors are miniaturized devices that often implement stored program CPUs.

Vacuum tube-based computers were in use throughout the 1950s, but were largely replaced in the 1960s by transistor-based devices, which were smaller, faster, cheaper, used less power and were more reliable. These factors allowed computers to be produced on an unprecedented commercial scale. By the 1970s, the adoption of integrated circuit technology and the subsequent creation of microprocessors such as the Intel 4004 caused another leap in size, speed, cost and reliability. By the 1980s, computers had become sufficiently small and cheap to replace simple mechanical controls in domestic appliances such as washing machines. Around the same time, computers became widely accessible for personal use by individuals in the form of home computers and the now ubiquitous personal computer. In conjunction with the widespread growth of the Internet since the 1990s, personal computers are becoming as common as the television and the telephone and almost all modern electronic devices contain a computer of some kind.

Wednesday, February 28, 2007

hcditrading

Product Specifications:

Dell's OptiPlex series combines power, functionality, versatility and ease of maintenance all in one machine. Built with business in mind, yet well suited for the home user, the OptiPlex series is a computer that offers a perfect combination of reliability, power and affordability. HCDI Product Code 16040package

Item Description:

Dell OptiPlex GX260 Desktop

Intel Pentium 4 2400MHz

512MB of RAM Installed

Two (2) DIMM slots

40GB IDE Hard Drive

IDE DVD ROM Drive Installed (Various speeds)

1.44 Floppy Drive Installed

Windows XP Pro installed with Media and Restore

Integrated Video Installed

Two (2) PCI Slots and (1) half height AGP Slot

Two (2) PS/2 Ports

Six (6) USB Ports / 4 Back and 2 Front

One (1) Serial Port and One (1) Parallel Port

Onboard Network and Audio Installed

PS/2 Keyboard and Mouse Included

15" Dell LCD Monitor (not model specific)

FREE Dell Speakers

* All units are thoroughly tested and must meet the demanding quality standards of HCDI before leaving our warehouse. Systems are guaranteed to be in excellent working condition and are unconditionally warranted for 30 days from purchase. Longer warranties are available. Operating systems can be added. Products are refurbished and/or used and may have some minor cosmetic blemishes, examples of which are (but not limited to) small scratches on case.

Computer

Friday, April 13, 2007

Web Search Engines: Part 1

INFRASTRUCTURE

CRAWLING ALGORITHMS

Real crawlers

FINAL CRAWLING THOUGHTS

Tuesday, March 27, 2007

Microsoft Office Professional 2007

Friday, March 9, 2007

history of computer

Wednesday, February 28, 2007

hcditrading

Computer

About Me

Previous Posts

Archives