Search Engines and Current Web Programming Development Tools

Quite a number of collections of search tools are available today that allow users to find information on the Web quickly and easily. Two basic approaches have evolved in response to the need to organise and to locate information on the World Wide Web. These are directories and search engines.
A directory offers a hierarchy representation of hyperlinks to Web pages and presentation broken down into topics and subtopics. On the other hand, a search engine is a set of programs that is used to search for information within a specific realm and collate that information in a database. Although search engine is really a general class of programs, the term is often used to specifically describe Internet search engines like Google, Alta Vista and Excite. They enable users to search for documents on the World Wide Web, FTP servers and USENET newsgroups.
Search engines can also be devised for offline content, such as a library catalogue, the contents of a personal hard drive, or a catalogue of museum collections. Generally search engines help people to organise and display information in a way which makes it readily accessible.
Search Tools
A search tools is software that enables a user to quickly and easily gain access to information. The collection of search tools is constantly evolving with new ones coming on the scene and others disappearing. There are two basic approaches that have evolved in response to the need to organise and locate information on the World Wide Web and these are directories and search engines. Both approaches allow information about Web pages that is contained in some database that already has been created either manually or using special programs that search the Web pages to be assess quickly and easily. A request for information is answered by the search tool retrieving the information from its already-constructed database of indexed Web details. Other definitions that relate to searching information on the Web are as follows:
Search Terminology
Search tool: This refers to any mechanism for locating information on the Web. Examples include search or metasearch engine, and directory.
Metasearch engine: This refers to an all-in-one search engine that performs a search by calling on more than one other search engine to do the actual work.
Query: This refers to the information entered into a form on a search engine’s Web page that describes the information being sought.
Query Syntax: This term is used to describe, the set of rules describing what constitute a legal query on some search engines, special symbols may be used in a query.
Query Semantic: This term is used to describe a set of rules that defines the meaning of a query.
Hit: This refers to a URL that a search engine returns in response to a query.
Match: This is a synonym for hit.
Relevancy score: This refers to a value that indicates how close a match, a URL was to a query; usually expressed as a value from 1 to 100, with the higher score meaning more relevant.
The first method of finding and organising Web information is stated earlier is the directory approach. A directory offers a hierarchy representation of hyperlinks to Web pages and presentation broken down into topics and subtopics. The hierarchy can descend many levels. The specific number of levels is determined by the taxonomy of topics. Examples of popular general directories are;   Looksmart , Lycos, Dmoz Yahoo, etc.
Search Engines
The second approach to organising information and locating information on the Web is a search engine, which is a computer program that does the following:
1. allows a user to submit a form containing a query that consists of a word or phrase describing the specific information of interest to be located from the Web
2. searches its database to try to match your query
3. collate and returns a list of clickable URLs containing presentations that match the user’s query; the list is usually ordered with the better matches appearing at a the top
4. permits a user to revise and resubmit a query.
A recent survey ranking the market share of web search engine carried out by Net Marketshare in December 2010, showed;
• Google is 84.65%,
• Yahoo is 6.69%,
• Baidu is 3.39%,
• Bing is 3.29% and
• Other is 1.98%.
Components of a Search Engine
Search engines have the following components:
  1. User Interface
  2. Database
  3. Robot or Spider Software
1. User Interface
The user interface is a mechanism by which users submit queries to the search engine by typing a keyword or phrases to search into the text box. When the form is submitted, the data typed into the text box is sent to a server-side script that searches the database using the keywords entered.
Afterwards, search results are displayed in the browser containing a list of information, such as the URLs for Web pages that meet the users’ criteria. This result set is formatted with a link to each page along with additional information that might include the page title, a brief description, the first few lines of text, or the size of the page and arelevancy score for each hit. This way, the user is able to make an informed choice as to which hyperlinks to follow. Hyperlinks to help files are usually displayed prominently, and advertisement should not hinder a reader’s use of the search engine. The order in which pages are displayed may depend on paid advertisement, alphabetical order, and link popularity. Each search engine has its own policy for ordering the search results. The policies can change over time.
A database is a collection of information organised so that its contents can easily be accessed, managed and updated. Databases management systems (DBMSs) such as Oracle, Microsoft SQL Server, Informix, MySQL or IBM DB2 are used to configure and manage the database. The databases associated with search engines are extremely large indexed pages that require a highly efficient search strategy to retrieve information from them. Computer scientists have spent years developing efficient several searching and sorting strategies, which are implemented in the search. The information displayed as results of your search is usually from the database accessed by the search engine site. Some search engines, such as AOL and Netscape use a database provided by Google.
A robot (sometimes called a spider) is a program that automatically traverses the hypertext structure of the Web by retrieving a Web page document and following the hyperlinks on the page. It moves like a robot spider on the Web, accessing and documenting Web pages. It requests pages from a website in the same way as Microsoft Explorer, or Firefox and any other browser does it. Spider does not collect images or formatting details. It is only interested in text and links and the URL from which they come.
The spider categorises the pages and stores information about the Web site and the Web pages in a database. Various robots may work differently, but in general, they access and may store important information on web pages such as title, meta takeyword, meta tag description, and some of the text on the page (usually either the first few sentences of the text contained in the heading tags). For multimedia elements in web pages to be indexed, the “alt” tag  should be used in order to have values in the search engines.
The spider software works in conjunction with the index software. This uses the information collected by the spider. The spider takes the information it has gathered about a web page and sends it to the index software where it is analysed and stored. The index makes sense of the mass of text, links and URLs using an algorithm, which refers to a complex mathematical formula that indexes the words, the pairs of words and so on.
The algorithm analyses the pages and links for word combinations to determine what the web pages are all about that is, what topics are being covered. Then, scores are assigned that allow the search engine to measure how relevant or important the web pages (and URLs) might be to the user or visitor. Major search engines such as Google, Yahoo or Bing use proprietary algorithm for scoring.
Listing in a Search Engine and Search Index
The components of a search engine (robot, database and search form) work together to obtain information about Web pages, store information about Web pages, and provide a graphical user interface to facilitate searching for and displaying a list of Web pages relevant to given key words. In recent times, search engines have become one of the top methods used to drive traffic to ecommerce sites. Though very effective, it is not always easy to get listed in a search engine or search directory. Recently, there is a trend away from free listing in search engines. Current trends entail paying for listing consideration in a search engine or directory. These approaches include an express submit or express inclusion, paying for preferential placement in search engine displays (called sponsoring or advertising), and paying each time a visitor clicks the search engine’s link to your site. Yahoo and Google use the terms Calls its Sponsor Results and Google AdWords respectively. In these programs, payment is made when the site is submitted for review. If accepted, the site has a listing usually at the top or right margin of the search results. In addition to the initial fee, the Web site owners must pay each time a visitor clicks on the search engine link to their site-this is called a cost-per-click (CPC).
A web search engine is designed to search for information on the World Wide Web, FTP servers USENET newsgroup, and so on. The search results, which may consist of web pages, images, information and other types of files, are generally presented in a list of results and are often called hits. Some search engines also mine data available in databases oropen directories. Unlike web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input. Search engines use automated software programs to survey the Web and build their databases. Web documents are retrieved by these programs and analysed. Data collected from each web page are then added to the search engine index. Each search engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query. The best URLs are then returned to the user as hits, ranked in order with the best results depending on the algorithm used by the search engine at the top.
Current Web Programming Development Tools
Advances in Internet technology have led to the release of several tools for Web development. Many of the tools are easy to use and made available to the public free of charge to aid in development. A popular example is the LAMP (Linux, Apache, MySQL, PHP) stack, which is usually distributed free of charge. The availability of free tools has greatly influenced the rate at which many people around the globe setup new Web sites daily. Easy to use software for Web development include amongst others: Adobe Dreamweaver, Netbeans, WebDev, or Microsoft Expression Studio, Adobe Flex, and so on.
By using these software, virtually anyone can develop a Web page in a matter of minutes. Knowledge of Hypertext Markup Language (HTML) or other programming language is not usually required, but is recommended for professional results. Newer generation of web development tools use the strong growth in LAMP, Java Platform, Enterprise Edition technologies and Microsoft .NET technologies to provide the Web as a way to run applications online. Web developers now help to deliver applications as Web services, which were traditionally only available as applications on a desk, based computer. Thus, instead of running executable code on a local computer, users can now interact with online applications to create new contents. This has enabled new methods in communication and allowed for many opportunities to decentralise information and media distribution. In this unit, we shall discuss other technologies, models and tools that enhance easy development of Web applications.
Web Services
There is no need reinventing the wheel with every new project. With web services, developers can use existing software solutions to create other feature-rich applications. A Web service is a self-describing, self-contained application that provides some business functionality through an Internet connection. For example, an organisation could create a Web service to facilitate information exchange with its partner or vendor. Web services make software functionality available over the Internet so that programs like PHP, ASP, JSP, JavaBeans, the COM object, and all other favourite widgets can make a request to a program running on another server (a web service) and use that program’s response in a website, Wireless Application Protocol (WAP) service, or other applications. The Universal Description, Discovery and Integration (UDDI) is a directory of web service interfaces and is used for storing information about web services. It also provides a method of describing a service, involving a service, and locating available services.
It is described by Web Services Description Language (WSDL). UDDI communicates via Simple Object Access Protocol (SOAP). SOAP is a communication protocol, which is language independent and based on XML. WSDL is based on XML and is used to describe and locate Web services. Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialisation in conjunction with other Web related standards.
Although UDDI is built into the Microsoft .NET platform, it is standard and backed by a number of technology companies, including IBM, Microsoft, and Sun Microsystem. Theincorporation of Web services into new programs allows the speedy development of new applications. The use of Web Application Programming Interface (Web API) is a current trend in Web 2.0 development in Web services where emphasis has been moving away from SOAP based services towards representational state transfer (REST) based communications.
REST services do not require XML, SOAP, or WSDL service-API definitions. Web API is typically a defined set of Hypertext Transfer Protocol (HTTP) request messages along with a definition of the structure of response messages, usually expressed in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. It allows the combination of multiple Web services into new applications known as mashups.
Fundamentally, Web services is all about having a service, publishing an API for use by other services on the network and encapsulating implementation details. The following essential services are expected in any service-oriented environment such as in Web services.
• A Web service needs to be created, and its interfaces and invocation methods must be well defined.
• A Web service needs to be published to one or more repositories (intranet or Internet) for potential users to locate.
• A Web service needs to be located to be invoked by potential users.
• A Web service needs to be invoked to be of any benefit.
• A Web service may need to be unpublished when it is no longer available or needed.
Cloud Computing
Cloud computing refers to the use and access of multiple server-based computational resources via a digital network (WAN, Internet connection using the World Wide Web, and so on). Cloud users may access the server resources using a computer, netbook, pad computer, smart phone, PDA, or other devices. In cloud computing, applications are provided and managed by the cloud server and data are stored remotely in the cloud configuration. Users do not download and install applications on their own device or computer; all processing and storage is maintained by the cloud server.
The on-line services are usually offered by a cloud provider or by a private organisation. Before the advent of cloud computing, tasks such as using word processing would not be possible without the installation of application software on a user’s computer. A user would need to purchase a license for each application from a software vendor and obtained the right to install the application on one computer system.
As computer technologies advanced, local area networks (LAN) and more networking capabilities, the client-server model of computing were born, where server computers with enhanced capabilities and large storage devices could be used to host application services and data for a large workgroup. In a client server computing environment, a network-friendly client version of the application was required on client computers, which utilised the client system's resources (memory and CPU for processing), even though the resultant application data files (such as word processing documents) were stored centrally on the data servers. In this case, many users on a network purchased multiple user licenses of an application for use. Cloud computing differs from the classic client-server model discussed in module one of this course material, by providing applications from a server that are executed and managed by a client's web browser, with no installed client version of an application required.
Cloud computing provides computation, software, data access, and storage services that do not require end-user knowledge of the physical location and configuration of the system that delivers the services. One may compare this scenario with the concept drawn from the electricity grid, wherein end-users consume power without needing to understand the component devices or infrastructure required to provide the service. The reason behind centralisation is to give cloud service providers complete control over the versions of the browser-based applications provided to clients, which removes the need for version upgrades or license management on individual client computing devices.
A blog is a blend of the term “Web log.” It is a type of Website or part of a Website. Many blogs provide commentary or news on a particular subject; others function as more personal online diaries. A typical blog combines text, images, and links to other blogs, Web pages, and other media related to its topic. The ability of readers to leave comments in an interactive format is an important part of many blogs. Most blogs are primarily textual, although some focus on art (art blog), photographs (photoblog), videos (video blogging), music (MP3 blog), and audio (podcasting).
Microblogging is another type of blogging, featuring very short posts. Most blogs are interactive, allowing visitors to leave comments and even communicate with each other via widgets on the blogs. This interactivity distinguishes them from other static websites. Entries are commonly displayed in reverse-chronological order. Many blogs are hosted at blog communities such as
Really Simple Syndication or Rich Site Summary (RSS) is commonly used to create newsfeed from blog postings and other Web sites. The RSS feeds contain a summary of new items posted to the site. Web feeds benefit publishers by letting them syndicate content automatically. They benefit readers who want to subscribe to timely updates from favoured websites or to aggregate feeds from many sites into one place. RSS feeds can be read using software called an “RSS reader”, “feed reader”, or “aggregator”, which can be web-based, desktop-based, or mobile device-based. Some browser, such as Firefox, Safari, and Internet 7 can display RSS feeds. A standardised XML file format allows the information to be published once and viewed by many different programs.
The user subscribes to a feed by entering into the reader the feed’s URL or by clicking a feed icon in a Web browser that initiates the subscription process. The RSS reader checks the user’s subscribed feeds regularly for new work, downloads any updates that it finds, and provides a user interface to monitor and read the feeds. RSS allows users to avoid inspecting all of the websites they are interested in manuallyand instead subscribe to Websites such that all new content is pushed onto their browsers when it becomes available. By providing up-to-date, linkable content for anyone to use, RSS enables website developers to draw more traffic. It also allows users to get news and information from many sources easily and reduces content developers time. RSS simplifies importing information from portals, weblogs and news sites. Any piece of information can be syndicated via RSS, not just news.
Podcasts are typically audio files, delivered by an RSS feed on the Web. They may also be made available by recording an MP3 file and providing a link on a Web page. They usually would take the format of an audio blog, interview or radio show. These files can be saved to your computer or to an MP3 player (such as iPod) for later listening.
A wiki is a Web site that allows immediate update by visitors using a simple form on a Web page at any time. Some wikis are designed to serve a small group of people such as the members of an organisation.The most powerful and popular wiki is Wikipedia which is accessible at the URL (http:://
It is an online encyclopaedia, which can be updated by any registered user at anytime. Wiki is a form of social software in action where visitors sharing their collective knowledge can create a resource freely used by all. Though there have been isolated cases of practical jokes and occasionally inaccurate information posted at Wikipedia, the information and resources provided is still good enough as starting point when exploring a topic.
Microformat is a standard format for representing information aggregate
that can be understood by computers thereby enabling easier access and retrieval of information. It could also lead to new types of applications/services on the Web. Some people consider the web as containing loose information while others see logical aggregates, business cards, resume, events, etc.
The need to organise information on the Web cannot be overemphasised. Microformat standard encourage sites to organise their information such that its increases interoperability and accessibility. For example, if one wants to create an event or an events calendar, one could use the hCalalender microformat. Some other available microformats are the adr for address information, hresume for resume and xfolk for collections of bookmarks.
Resources Description Framework (RDF)
The Resource Description Framework (RDF), developed by the World Wide Web consortium (W3C) is one way of making the Web more meaningful. It is based on XML and used to describe content in a way that is understood by computers. RDF helps connect isolated databases across the web with consistent semantics. The structure of any expression in RDF is a collection of triples. RDF triples consist of two pieces of information (subject and object) and linking fact (predicate).
Advances in Internet technologies makes items on the Web to be organised in such a way that meaning can be easily derived from them. Ontologies are ways of organising and describing related items, and are used to represent semantics. It serves as a means of cataloguing Internet content in a way that can be understood by the computers. RDF and OWL (Web Ontology Language) are designed for formatting ontologies.
Application Programming Interface (APIs)
Application Programming Interface (APIs) provides application with access to external services and databases. For example, a traditional programming API, like the Sun’s Java API, allows programmers to use already-written methods and functions in their programs. In addition, Web services have APIs that permit their functionality and information to be shared or used across the internet. Most major Web 2.0 companies (for example, eBay, Amazon, Google, Yahoo! and Flickr) provide APIs to encourage use of their services and data in the development of mashups, widgets or gadgets.
Mashups is a means of combining contents or functionality from existing Web services, Websites and RSS feeds or other solutions to serve a new purpose. For example, a skilled developer could mashup Google Maps with a tourist site to create more exciting services/sites on the Internet. The use of APIs helps to save lots of time and money in mashups processes of combining two or more applications to create others. Its possible to build great mashups in a day. Please, note that the mashup may rely on one or more third parties software. Thus, if the API provider experiences downtime, the mashup will be unavailable as well because of the dependence. The way out will be to use mashup that are programmed to avoid sites that could be down.
Widgets and Gadgets
Widgets are commonly referred to as gadgets. They are mini applications designed to run either as stand alone or as add-on features in Web pages. Widgets can be used to for the personalization of a user’s Internet experience. Some personalised services may include the display of real-time weather conditions, viewing of maps, receiving event reminder, providing easy access to search engines, aggregating RSS feeds, and so on. The robustness of web services, APIs and other related tools make it easy to develop Widgets. Several catalogs of widgets exist online with the most all-inclusive being Widgipedia which provides an extensive widgets and gadgets for a variety of platform.
Web 2.0
The term “Web 2.0” is associated with Web applications that facilitate participatory information sharing, interoperability, user-centred design, and collaboration on the World Wide Web. A Web 2.0 site allows users interact and collaborate with each other in a social media dialogue as creators (prosumers) of user-generated content in a virtual community, in contrast to websites where users (consumers) are limited to the passive viewing of content that was created for them. Examples of Web2.0 include social networking sites, blogs, wikis, video sharing sites, hosted services, web applications, mashups and folksonomies.
Web 2.0 websites allow users to do more than just retrieve information. By increasing what was already possible in Web 1.0, they provide the user with more user-interface, software and storage facilities, all through their browser. Users can provide the data that is on a Web 2.0 site and exercise some control over that data. These sites may have an “Architecture of participation” that encourages users to add value to the application as they use it. The Web 2.0 offers all users the same freedom to contribute.
Web 2.0 Tools
The client-side/web browser technologies used in Web 2.0 development are Asynchronous JavaScript and XML (Ajax), Adobe Flash and the Adobe Flex framework, and JavaScript/Ajax Dojo Toolkit, MooTools, jQuery, and so on. Ajax programming uses JavaScript to upload and download new data from the web server without undergoing a full page reload. To allow users to continue to interact with the page, communications such as data requests going to the server are separatedfrom data coming back to the page (asynchronously).
Otherwise, the user would have to routinely wait for the data to come back before they can do anything else on that page, just as a user has to wait for a page to complete the reload. This also increases overall performance of the site,as the sending of requests can complete quicker independent of blocking and queuing required sending data back to the client. The data fetched by an Ajax request is typically formatted in XML or JSON (JavaScript Object Notation) format, which constitute the two widely, used structured data formats. Since both of these formats are natively understood by JavaScript, a programmer can easily use them to transmit structured data in their web application. When this data is received via Ajax, the JavaScript program then uses the Document Object Model (DOM) to dynamically update the web page based on the new data, allowing for a rapid and interactive user experience. In short, using these techniques, Web designers can make their pages function like desktop applications. For example, Google Docs uses this technique to create a Web based word processor.
Adobe Flex is another technology often used in Web 2.0 applications. Compared to JavaScript libraries like jQuery, Flex makes it easier for programmers to populate large data grids, charts, and other heavy user interactions.[ Applications programmed in Flex, are compiled and displayed as Flash within the browser. Flash is capable of doing many things which were not possible pre-HTML5, the language used to construct web pages. Out of the many capabilities, of Flash, the most commonly used in Web 2.0 is its ability to play audio and video files. This has allowed for the creation of Web 2.0 sites where video media is seamlessly integrated with standard HTML. In addition to Flash and Ajax, JavaScript/Ajax frameworks have recently become a very popular means of creating Web 2.0 sites. At their core, these frameworks do not use technology any different from JavaScript, Ajax, and the DOM.
What frameworks do is smooth over inconsistencies between web browsers and extends the functionality available to developers. Many of them also come with customisable, prefabricated “widgets” that accomplish such common tasks as picking a date from a calendar, displaying a data chart, or making a tabbed panel. On the server side, Web 2.0 uses many of the same technologies as Web1.0. New languages such as PHP, Ruby, Perl, Python, JSP and ASP are used by developers to dynamically output data using information from files and databases. What has begun to change in Web 2.0 is the way this data is formatted. In the early days of the Internet, there was little need for different websites to communicate with each other and share data. In the new “participatory web”.
However, sharing data between sites has become an essential capability. To share its data with other sites, a website must be able to generate output in machine-readable formats such as XML (Atom, RSS, etc) and JSON. When a site’s data is available in one of these formats, another website can use it to integrate a portion of that site's functionality into itself, linking the two together. This is one of the hallmarks of the philosophy behind the Web3.15
eXtensible Hypertext Markup Language (XHTML) is the newer version of HTML. XHTML combines the formatting strengths of HTML and the data structures and extensibility strengths of XML to deploy applications for device-independent Web access. XHTML uses the tags and attributes of HTML along with the syntax to XML. Using HTML to write application that runs on electronic devices with fewer resources such as a personal digital assistant (PDA) or mobile phone could be an issue. However, this can be accomplished in XHTML since it is more of a descriptive language (unlike HTML) than a structure language.
Summary & Conclusion
A web search engine is designed to search for information on the World Wide Web, FTP servers USENET newsgroup, and so on. The search results, which may consist of web pages, images, information and other types of files, are generally presented in a list of results and are often called hits. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input.
Search engines use automated software programs to survey the Web and build their databases. Web documents are retrieved by these programs and analysed. Data collected from each web page are then added to the search engine index. Each search engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query. The best URLs are then returned to the user as hits, ranked in order with the best results depending on the algorithm used by the search engine at the top.
The Internet is playing a great role in the delivery of contents to users all across the world. Many researches are going on every day to make it more accessible, available, interactive, meaningful and responsive to users’ needs.
Most of the information in this discourse has been presented for you to keep up-to-date with current internet and web programming developments.