Sunday, April 26, 2009

How you can Use reCAPTCHA to Stop Comment Spam

Comment spam is a nuisance more cheerful. It is also costly for the business have to deal with it.

But today I found an effective solution that not only free (that magic word), it is also easy to install. And 'easy' is, of course, the other magic words that we all love.

First, I created a new blog on WordPress Harmedia.com - and it was as easy as falling magazine because I used FANTASTICO, which is built into the web hosting CPanel.

Just now I am finishing out of things with a few additional plugins. Includes reCAPTCHA, which is available free of charge. And, as I said, it is so easy to install.

Step 1. You download from recaptcha.net and unzip the file.

2. You downloaded the two. PHP-files to your Wordpress plugins folder.

3. Then you go to your Plugins page and activate the reCAPTCHA, clicking on it. (so-called public and private keys, and they are unique to this your site). That's all. You did.

WHY I LIKE reCAPTCHA

- It's free.

- It works.

- There are no pet around with copying the code into your blog template.

- No access to change. (Chmod isn't hard once you know what you're doing, but before you learn that befuddled many beginners, I would say.)

- A feel-good bonus is the fact that you are helping the Internet Archive to improve the accuracy of optical character recognition software.

How so? From the words shown directly from old books, which are currently being digitized. The words that OCR software can not be fed in captchas, and if multiple users, as you are me and make it the same way that the result is entered into the memory of OCR.

Thus, not only did not get your blog is being protected from spam bots, your comments help the old books to get the correct number.

Do you use captchas to protect your blog from comment spam yet? It's free and easy. What are you waiting for?

By the way, CAPTCHA stands for "completely automated public Turing test to tell Computers and Humans Apart", although, ironically, in recaptcha.net site does not contain the word public in his explanation of the term. So .. that would make it CATCHA? - No P.

Friday, January 5, 2007

Black Hat Methods

"Black hat" SEO are methods to try to improve rankings that are disapproved of by the search engines and/or involve deception. This can range from text that is "hidden", either as text colored similar to the background or in an invisible or left of visible div, or by redirecting users from a page that is built for search engines to one that is more human friendly. As a general rule, a method that sends a user to a page that was different from the page the search engined ranked is Black hat. One well known example is Cloaking, the practice of serving one version of a page to search engine spiders/bots and another version to human visitors.

Search engines can and do penalize sites they discover using black hat methods, either by reducing their rankings or eliminating their listings from their databases altogether. Such penalties can be applied either automatically by the search engines' algorithms, or by a manual review of a site.

One infamous example was the February 2006 Google removal of both BMW Germany and Ricoh Germany for use of deceptive practices. However, both companies quickly apologized, fixed the offending pages, and were restored to Google's list.

Blocking Robots

Webmasters can instruct spiders not to crawl certain files or directories through the standard robots.txt file in the root directory of the domain. Additionally, a page can be explicitly excluded from a search engine's database by using a robots meta tag.

When a search engine visits a site, the robots.txt located in the root folder is the first file crawled. The robots.txt file is then parsed, and only pages not disallowed will be crawled. As a search engine crawler may keep a cached copy of this file, it may on occasion crawl pages a webmaster does not wished crawled.

Pages typically prevented from being crawled include login specific pages such as shopping carts and user-specific content such as search results from internal searches.

Getting into Search Engines' Databases

Today's major search engines, by and large, do not require any extra effort to submit to, as they are capable of finding pages via links on other sites.

However, Google and Yahoo offer submission programs, such as Google Sitemaps, for which an XML type feed can be created and submitted. Generally, however, a simple link from a site already indexed will get the search engines to visit a new site and begin spidering its contents. It can take a few days or even weeks from the acquisition of a link from such a site for all the main search engine spiders to begin indexing a new site, and there is usually not much that can be done to speed up this process.

Once the search engine finds a new site, it uses a crawler program to retrieve and index the pages on the site. Pages can only be found when linked to with visible hyperlinks. For instance, some search engines are starting to read links created by Flash (for example, Google).
Search engine crawlers may look at a number of different factors when crawling a site, and many pages from a site may not be indexed by the search engines until they gain more PageRank, links or traffic. Distance of pages from the root directory of a site may also be a factor in whether or not pages get crawled, as well as other importance metrics. Cho et al. described some standards for those decisions as to which pages are visited and sent by a crawler to be included in a search engine's index.

A few search engines, such as Yahoo!, operate paid submission services that guarantee crawling for either a set fee or CPC. Such programs usually guarantee inclusion in the database, but does not guarantee specific ranking within the search results.

The Relationship between SEO and the Search Engines

The first mentions of Search Engine Optimization do not appear on Usenet until 1997, a few years after the launch of the first Internet search engines. The operators of search engines recognized quickly that some people from the webmaster community were making efforts to rank well in their search engines, and even manipulating the page rankings in search results. In some early search engines, such as Infoseek, ranking first was as easy as grabbing the source code of the top-ranked page, placing it on your website, and submitting a URL to instantly index and rank that page.

Due to the high value and targeting of search results, there is potential for an adversarial relationship between search engines and SEOs. In 2005, an annual conference named AirWeb was created to discuss bridging the gap and minimizing the sometimes damaging effects of aggressive web content providers.

Some more aggressive site owners and SEOs generate automated sites or employ techniques that eventually get domains banned from the search engines. Many search engine optimization companies, which sell services, employ long-term, low-risk strategies, and most SEO firms that do employ high-risk strategies do so on their own affiliate, lead-generation, or content sites, instead of risking client websites.

Some SEO companies employ aggressive techniques that get their client websites banned from the search results. The Wall Street Journal profiled a company that allegedly used high-risk techniques and failed to disclose those risks to its clients. Wired reported the same company sued a blogger for mentioning that they were banned. Google's Matt Cutts later confirmed that Google did in fact ban Traffic Power and some of its clients.

Some search engines have also reached out to the SEO industry, and are frequent sponsors and guests at SEO conferences and seminars. In fact, with the advent of paid inclusion, some search engines now have a vested interest in the health of the optimization community. All of the main search engines provide information/guidelines to help with site optimization: Google's, Yahoo!'s, MSN's and Ask.com's. Google has a Sitemaps program to help webmasters learn if Google is having any problems indexing their website and also provides data on Google traffic to the website. Yahoo! has Site Explorer that provides a way to submit your URLs for free (like MSN/Google), determine how many pages are in the Yahoo! index and drill down on inlinks to deep pages. Yahoo! has an Ambassador Program and Google has a program for qualifying Google Advertising Professionals.

Development of more Sophisticated Ranking Algorithms

Google was started by two PhD students at Stanford University, Sergey Brin and Larry Page, and brought a new concept to evaluating web pages. This concept, called PageRank, has been important to the Google algorithm from the start. PageRank is an algorithm that weights a page's importance based upon the incoming links. PageRank estimates the likelihood that a given page will be reached by a web user who randomly surfed the web, and followed links from one page to another. In effect, this means that some links are more valuable than others, as a higher PageRank page is more likely to be reached by the random surfer.

The PageRank algorithm proved very effective, and Google began to be perceived as serving the most relevant search results. On the back of strong word of mouth from programmers, Google quickly became the most popular and successful search engine. PageRank measured an off-site factor, Google felt it would be more difficult to manipulate than on-page factors.

Despite being difficult to game, webmasters had already developed link building tools and schemes to influence the Inktomi search engine, and these methods proved similarly applicable to gaining PageRank. Many sites focused on exchanging, buying, and selling links, often on a massive scale. This has spawned an online industry, that survives to this day, focused upon selling links designed to improve PageRank and link popularity, and not to drive human site visitors, with links from higher PageRank pages selling for the most money.

A proxy for the PageRank metric is still displayed in the Google Toolbar, though the displayed value is rounded to be an integer, and the toolbar is believed to be updated less frequently and independently of the value used internally by Google. In 2002 a Google spokesperson stated that PageRank is only one of more than 100 algorithms used in ranking pages, and that while the toolbar PageRank is interesting for users and webmasters, "the value to search engine optimization professionals is limited" because the value is only an approximation. Many experienced SEOs recommend ignoring the displayed PageRank.

Google — and other search engines — have, over the years, developed a wider range of off-site factors they use in their algorithms. The Internet was reaching a vast population of non-technical users who were often unable to use advanced querying techniques to reach the information they were seeking and the sheer volume and complexity of the indexed data was vastly different from that of the early days. Combined with increases in processing power, search engines have begun to develop predictive, semantic, linguistic and heuristic algorithms. Around the same time as the work that led to Google, IBM had begun work on the Clever Project, and Jon Kleinberg was developing the HITS algorithm.
As a search engine may use hundreds of factors in ranking the listings on its SERPs; the factors themselves and the weight each carries can change continually, and algorithms can differ widely, with a web page that ranks #1 in a particular search engine possibly ranking #200 in another search engine, or even on the same search engine a few days later.

Google, Yahoo, Microsoft and Ask.com do not disclose the algorithms they use to rank pages. Some SEOs have carried out controlled experiments to gauge the effects of different approaches to search optimization. Based on these experiments, often shared through online forums and blogs, professional SEOs attempt to form a consensus on what methods work best, although consensus is rarely, if ever, actually reached.

SEOs widely agree that the signals that influence a page's rankings include:

1- Keywords in the title tag.

2- Keywords in links pointing to the page.

3- Keywords appearing in visible text.

4- Link popularity (PageRank for Google) of the page.


There are many other signals that may affect a page's ranking, indicated in a number of patents held by various search engines, such as historical data.

Search engine optimization (SEO)

Search engine optimization (SEO) as a subset of search engine marketing seeks to improve the number and quality of visitors to a web site from "natural" ("organic" or "algorithmic") search results. The quality of visitor traffic can be measured by how often a visitor using a specific keyword leads to a desired conversion action, such as making a purchase or requesting further information. In effect, SEO is marketing by appealing first to machine algorithms to increase search engine relevance and secondly to human visitors. The term SEO can also refer to "search engine optimizers", an industry of consultants who carry out optimization projects on behalf of clients.

Search engine optimization is available as a stand-alone service or as a part of a larger marketing campaign. Because SEO often requires making changes to the source code of a site, it is often most effective when incorporated into the initial development and design of a site, leading to the use of the term "Search Engine Friendly" to describe designs, menus, Content management systems and shopping carts that can be optimized easily and effectively.

A range of strategies and techniques are employed in SEO, including changes to a site's code (referred to as "on page factors") and getting links from other sites (referred to as "off page factors"). These techniques include two broad categories: techniques that search engines recommend as part of good design, and those techniques that search engines do not approve of and attempt to minimize the effect of, referred to as spamdexing. Some industry commentators classify these methods, and the practitioners who utilize them, as either "white hat SEO", or "black hat SEO". Other SEOs reject the black and white hat dichotomy as an over-simplification.

Paid Advertising

In return for a small payment many larger companies choose to advertise their sites on other popular sites. This e-marketing usually takes the form of:

1- Banner advertising: Banner impressions are sold by the thousands, and referred to as Cost Per Impression (CPM). As of 2004, prices range from $1/CPM for a run-of-network to about $50/CPM or more for specialized targeted runs. Most popular web sites sell banner advertising space, with the notable exception of Google.

2- Pay per clicks: Advertisers "buy" keywords or keyphrases by bidding on them against other advertisers. The so called Pay-per-click engines sell their premium spaces showing in the searches the highest paying advertisers. Google sells paid advertisement through its AdWords and AdSense systems, which place sponsored links on search pages. Overture, now owned by Yahoo!, is one of the most popular pay-per-click advertising venues.

As users got used to seeing banners, some companies chose to make the advertisements more intrusive – pop-up ads became particularly popular to attract attention. However, most people consider pop-ups a nuisance and several software companies offer free pop-up blockers. Even Microsoft included a pop-up blocker in Service Pack 2 of Windows XP.

Increasing Web Traffic

Web traffic can be increased by placement of a site in search engines and purchase of advertising, including bulk e-mail, pop-up ads, and in-page advertisements. Web traffic can also be increased by purchasing non-internet based advertising.
If a web page is not listed in the first pages of any search, the odds of someone finding it diminishes greatly (especially if there is other competition on the first page). Very few people go past the first page, and the percentage that go to subsequent pages is substantially lower. Consequently, getting proper placement on search engines is as important as the web site itself.
There are a number of other things you can do to increase your web traffic, including but not limited to building link popularity, webrings, offering free e-books or articles and classified advertisements.
Of the above mentioned items, perhaps the easiest one to do is building link popularity. This can be accomplished by writing e-mails to sites similar to yours and asking if they would link to your site. The second way of increasing your web traffic is writing to e-zines or to free article sites. There are many sites which will accept your written material, the catch is that you are giving it away for free. The benefit however is that you get to include a link to site in the article. Meaning every time someone clicks on your link, it is free traffic for your site. Pixel ads were quite popular to bring traffics but not quite targeted audience.

Measuring Web Traffic

Web traffic is measured to see the popularity of web sites and individual pages or sections within a site.
Web traffic can be analysed by viewing the traffic statistics found in the web server log file, an automatically-generated list of all the pages served. A hit is generated when any file is served. The page itself is considered a file, but images are also files, thus a page with 5 images could generate 6 hits (the 5 images and the page itself). A page view is generated when a visitor requests any page within the web site – a visitor will always generate at least one page view (the main page) but could generate many more.
Tracking applications external to the web site can record traffic by inserting a small piece of HTML code in every page of the web site.
Web traffic is also sometimes measured by packet sniffing and thus gaining random samples of traffic data from which to extrapolate information about web traffic as a whole across total Internet usage.
The following types of information are often collated when monitoring web traffic:
1- The number of visitors 2- The average number of page views per visitor – a high number would indicate that the average visitors go deep inside the site, possibly because they like it or find it useful. Conversely, it could indicate an inability to find desired information easily. 3- Average visit duration – the total length of a user's visit 4- Average page duration – how long a page is viewed for 5- Domain classes – the top level domain of the ISP a visitor uses, useful for finding out geographical statistics 6- Busy times – the most popular viewing time of the site would show when would be the best time to do promotional campaigns and when would be the most ideal to perform maintenance 7- Most requested pages – the most popular pages 8- Most requested entry pages – the entry page is the first page viewed by a visitor and shows which are the pages most attracting visitors 9- Most requested exit pages – the most requested exit pages could help find bad pages, broken links or the exit pages may have a popular external link 10- Top paths – a path is the sequence of pages viewed by visitors from entry to exit, with the top paths identifying the way most customers go through the site 11- Referrers; The host can track the (apparent) source of the links and determine which sites are generating the most traffic for a particular page.
Web sites like Alexa Internet produce traffic rankings and statistics based on those people who access the sites while using the Alexa toolbar. The difficulty with this is that it's not looking at the complete traffic picture for a site. Large sites usually hire the services of companies like Nielsen Netratings, but their reports are available only by subscription.

Web Traffic

Web traffic is the amount of data sent and received by visitors to a web site. It is a large portion of Internet traffic. This is determined by the number of visitors and the number of pages they visit. Sites monitor the incoming and outgoing traffic to see which parts or pages of their site are popular and if there are any apparent trends, such as one specific page being viewed mostly by people in a particular country. There are many ways to monitor this traffic and the gathered data is used to help structure sites, highlight security problems or indicate a potential lack of bandwidth – not all web traffic is welcome.
Some companies offer advertising schemes that, in return for increased web traffic (visitors), pay for screen space on the site. Sites also often aim to increase their web traffic through inclusion on search engines and through SEO.

Internet Traffic

Internet traffic is the flow of data around the Internet. It includes web traffic, which is the amount of that data that is related to the World Wide Web, along with the traffic from other major uses of the Internet, such as electronic mail and peer-to-peer networks.

Internet Protocol

In this context, there are three layers of protocols:
At the lowest level is IP (Internet Protocol), which defines the datagrams or packets that carry blocks of data from one node to another. The vast majority of today's Internet uses version four of the IP protocol (i.e. IPv4), and although IPv6 is standardised, it exists only as "islands" of connectivity, and there are many ISPs who don't have any IPv6 connectivity at all. Next come TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) - the protocols by which one host sends data to another. The former makes a virtual 'connection', which gives some level of guarantee of reliability. The latter is a best-effort, connectionless transport, in which data packets that are lost in transit will not be re-sent. On top comes the application protocol. This defines the specific messages and data formats sent and understood by the applications running at each end of the communication. Unlike older communications systems, the Internet protocol suite was designed to be independent of the underlying physical medium. Any communications network, wired or wireless, that can carry two-way digital data can carry Internet traffic. Thus, Internet packets flow through wired networks like copper wire, coaxial cable, and fibre optic, and through wireless networks like Wi-Fi. Together, all these networks, sharing the same protocols, form the Internet.
The Internet protocols originate from discussions within the Internet Engineering Task Force (IETF) and its working groups, which are open to public participation and review. These committees produce documents that are known as Request for Comments documents (RFCs). Some RFCs are raised to the status of Internet Standard by the IETF process.
Some of the most-used application protocols in the Internet protocol suite are DNS, POP3, IMAP, SMTP, HTTP, HTTPS and FTP. There are many other important ones; see the lists provided in these articles.
All services on the Internet make use of defined application protocols. Of these, e-mail and the World Wide Web are among the best known, and other services are built upon these, such as mailing lists and blogs. There are many others that are necessary 'behind the scenes' and yet others that serve specialised requirements.
Some application protocols were not created out of the IETF process, but initially as part of proprietary commercial or private experimental systems. They became much more widely used and have now become de facto or actual standards in their own right.

Creation of the Internet

The USSR's launch of Sputnik spurred the United States to create the Advanced Research Projects Agency (ARPA, later known as the Defense Advanced Research Projects Agency, or DARPA) in February 1958 to regain a technological lead. ARPA created the Information Processing Technology Office (IPTO) to further the research of the Semi Automatic Ground Environment (SAGE) program, which had networked country-wide radar systems together for the first time. J. C. R. Licklider was selected to head the IPTO, and saw universal networking as a potential unifying human revolution.
In 1950, Licklider moved from the Psycho-Acoustic Laboratory at Harvard University to MIT where he served on a committee that established MIT Lincoln Laboratory. He worked on the SAGE project. In 1957 he became a Vice President at BBN, where he bought the first production PDP-1 computer and conducted the first public demonstration of time-sharing.
Licklider recruited Lawrence Roberts to head a project to implement a network, and Roberts based the technology on the work of Paul Baran who had written an exhaustive study for the U.S. Air Force that recommended packet switching (as opposed to Circuit switching) to make a network highly robust and survivable. After much work, the first node went live at UCLA on October 29, 1969 on what would be called the ARPANET, one of the "eve" networks of today's Internet. Following on from this, the British Post Office, Western Union International and Tymnet collaborated to create the first international packet switched network, referred to as the International Packet Switched Service (IPSS), in 1978. This network grew from Europe and the US to cover Canada, Hong Kong and Australia by 1981.
The first TCP/IP wide area network was operational by 1 January 1983, when the United States' National Science Foundation (NSF) constructed a university network backbone that would later become the NSFNet. (This date is held by some to be technically that of the birth of the Internet.) It was then followed by the opening of the network to commercial interests in 1985. Important, separate networks that offered gateways into, then later merged with, the NSFNet include Usenet, Bitnet and the various commercial and educational X.25 Compuserve and JANET. Telenet (later called Sprintnet), was a large privately-funded national computer network with free dialup access in cities throughout the U.S. that had been in operation since the 1970s. This network eventually merged with the others in the 1990s as the TCP/IP protocol became increasingly popular. The ability of TCP/IP to work over these pre-existing communication networks, especially the international X.25 IPSS network, allowed for a great ease of growth. Use of the term "Internet" to describe a single global TCP/IP network originated around this time.
The network gained a public face in the 1990s. On August 6th, 1991 CERN, which straddles the border between France and Switzerland publicized the new World Wide Web project, two years after Tim Berners-Lee had begun creating HTML, HTTP and the first few Web pages at CERN.
An early popular Web browser was ViolaWWW based upon HyperCard. It was eventually replaced in popularity by the Mosaic Web Browser. In 1993 the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign released version 1.0 of Mosaic and by late 1994 there was growing public interest in the previously academic/technical Internet. By 1996 the word "Internet" was coming into common daily usage, frequently misused to refer to the World Wide Web.
Meanwhile, over the course of the decade, the Internet successfully accommodated the majority of previously existing public computer networks (although some networks such as FidoNet have remained separate). This growth is often attributed to the lack of central administration, which allows organic growth of the network, as well as the non-proprietary open nature of the Internet protocols, which encourages vendor interoperability and prevents any one company from exerting too much control over the network.

Internet and WWW

The Internet and the World Wide Web are not synonymous: the Internet is a collection of interconnected computer networks, linked by copper wires, fiber-optic cables, wireless connections, etc.; the Web is a collection of interconnected documents and other resources, linked by hyperlinks and URLs. The World Wide Web is accessible via the Internet, as are many other services including e-mail, file sharing, and others described below.
The best way to define and distinguish between these terms is with reference to the Internet protocol suite. This collection of standards and protocols is organized into layers such that each layer provides the foundation and the services required by the layer above. In this conception, the term Internet refers to computers and networks that communicate using IP (Internet protocol) and TCP (transfer control protocol). Once this networking structure is established, then other protocols can run “on top.” These other protocols are sometimes called services or applications. Hypertext transfer protocol, or HTTP, is the application layer protocol that links and provides access to the files, documents and other resources of the World Wide Web.