New structure of Choralwiki websites - November 29, 2009

Post by **choralia** » 29 Nov 2009 11:20

To manage the always increasing amount of traffic at ChoralWiki, two types of website have been introduced today:

- the contributors' website, intended for users who intend to upload new scores, edit pages, add translations, i.e., everything requires to "write" new information into the website. This website requires to register and login to view most pages;

- the users' websites, intended for users who only need to search information at ChoralWiki, download scores, i.e., everything can be done with "read-only" privileges. These websites can be entirely viewed without registering or logging-in, so they are completely unrestricted for anonymous users.

When visiting http://www.cpdl.org, unregistered users (or registered users whose cpdl login expired) are automatically directed to one of the users' websites. Registered users having a cpdl login active (not expired) are automatically directed to the contributors' website instead.

Users visiting a users' website can go at any time to the contributors' website (for example, because they decided to edit a page) by just following the link shown on top of each page.

Currently, there is one contributors' website and two users' website. Users' websites are daily aligned to the contributors' website, so that the newest additions are available on all websites. We hope this solves the traffic congestion problems recently experienced. The architecture is scaleable, so that additional users' websites can be added to accommodate further traffic growth in the future.

This new structure is now experimental, so any feedback about possible problems observed or suggested improvements is welcome.

Max

Post by **joachim** » 03 Dec 2009 07:17

Hi Max

First of all, thanks for / congrats on your work at CPDL. The new dual-site system indeed seems to have alleviated the pressure on the website.
Still, I can't help but notice the huge amount of people still logging in to the contributor's site without actually contributing scores. Unless I've misunderstood the strategy, I thought people merely consulting the database were being referred to the users' site. Could that mean the message on top of the page is not quite effective enough yet?
Given the crucial aspect of keeping the lid on server requests, wouldn't it be time to revise the main page (more radically than what is currently being considered), sticking to a general welcome and a multi-lingual referral to either contributor or user site, and thus [i]postponing[/i] score listing, search options etc. to a second page?

Just my 5 cts.

Joachim

Post by **choralia** » 03 Dec 2009 17:46

Thank you for your consideration, Joachim.

Yesterday (December 2nd), the contributors' website (www) handled about 52% of all page requests, while the two users' websites handled 12% (www1) and 36% (www2). Traffic to the users' websites is intentionally distributed with a 1:3 ratio at the moment, as one of the two servers appears more powerful than the other. The ideal distribution from an IT viewpoint would be 43% on www and www2, and 14% on www1. We are not very far from that.

You correctly understood the strategy, i.e., users that merely consult the database should go to the users' websites and browse them anonymously, while the contributors' website should be used to actually contribute. However, the above figures show that most non-contributing users prefer to stick to the "traditional" www URL, and the mandatory registration/login is not a deterrent for them: they prefer to register and login rather than to anonymously browse the users' websites.

I proposed to introduce a "welcome" page that was supposed to better drive traffic to the various websites, however other CPDL admins/managers didn't like the idea. We have also to consider that most users "land" onto ChoralWiki directly on a work or composer page from search engines, so in most cases the "welcome" page would be bypassed anyway.

Discussions are still in progress on these matters, and any feedback/suggestions are really welcome.

Max

Post by **EJG** » 26 May 2010 22:44

Looking at http://www2.cpdl.org/robots.txt, it seems that crawler access is not permitted to the visitor sites, but looking at http://www.cpdl.org/robots.txt, there is coding to limit crawler traffic, but not stop it altogether (thus reducing the server load). Given that the contributor site requires login for most pages, though, this presumably means that most pages will not be indexed by search engines. Would it be possible to use coding similar to that in http://www.cpdl.org/robots.txt on one of the visitor sites, in order to maintain the presence of composer/work pages in search engines (which is likely to be a source of human visitors to the site) without putting too much load onto the servers?

Regards,
Cydonia.

Post by **choralia** » 06 Jun 2010 19:42

Hi,

All main crawlers have a "backdoor" access to the contributor wiki (www.cpdl.org), so all pages at www are regularly indexed. This is confirmed by the report by google: today it shows that some 17,000 pages are indexed at www. This is the reason why we keep crawlers out of www1 and www2 through robots.txt: we want that www is indexed to avoid duplications. Unregistered users that perform a search and invoke a page at www are then automatically redirected to the corresponding page either on www1 or on www2.

For your information, in May www handled about 337,000 page impressions for "human" visitors, and about 1,121,000 for crawlers. So about 77% of the traffic at www is to keep search engines updated, while only 23% is for actual contributors. In the same period, www1 handled about 1,249,000 page impressions for human visitors (93%), and only about 97,000 (7%) for crawlers (apparently some crawlers intentionally ignore that robots.txt asks not to index this website). Similarly, www2 handled about 7,858,000 page impressions for human visitors (94%), and about 495,000 (6%) for crawlers. Total traffic (including crawlers) is 13% on www, 12% on www1, and 75% on www2.

I take the opportunity to inform users that a new, more powerful server (similar to www2) has been just set-up to replace the current www1 server, and thus increase our traffic capacity in preparation of the traditional traffic peak expected from October to December. The handover from the old server to the new server will occur next week, and, if everything goes as expected, it will be completely transparent to users.

Max

AlexMyltsev · Post by **AlexMyltsev** » 04 Jul 2013 10:22

Hello,

it seems that, as of 2013, only Google has adequate access to the content of CPDL. It is able to find pages on CPDL and show search snippets (sample query).

Bing finds pages on CPDL, but does not show snippets for them because "it is not allowed by the web site" (sample query).

Non-US search engines don't seem to index CPDL at all, probably because of the redirections from http://www.cpdl.org to www{0,1,2,3}.cpdl.org and the restrictive robots.txt on the mirror sites (sample queries: Yandex, Naver, Seznam).

Could you please explain the procedure to get "backdoor" crawler access, or give an update on how crawlers are supposed to work with CPDL?

Post by **choralia** » 08 Jul 2013 18:10

Due to the very large number of pages on CPDL, crawlers may use a lot of server resources and even cause outages to normal users. At the present time, even with a very limited number of backdoors for the most popular crawlers such as google, traffic from them represents some 75% of the total traffic on the contributor server (www). Opening further backdoors may increase the traffic to a critical level, so this should be justified by the number of user visits that originate from the associated search engines.

These forums are open to all crawlers, and most user visits from search engines are related to google (data of June 2013):

: crawlers.gif (4.29 KiB) Viewed 50377 times

If the main CPDL website were also open to all crawlers, we would probably still receive some 95% of the visits from google anyway.

As you correctly remarked, bing was excluded from the list of allowed crawlers, while msnbot, its predecessor, was still present. This was a mistake. I've added bing now. Thank you for this heads up. However, the additional number of visits originated from bing will be probably less than 2%, as shown in the table above.

AlexMyltsev wrote:Could you please explain the procedure to get "backdoor" crawler access, or give an update on how crawlers are supposed to work with CPDL?

A crawler can register a user account and access CPDL through it. Alternatively, a backdoor can be opened, however this should be justified by the number of visits from the associated search engine, otherwise we just get more crawler traffic and no users traffic.

Max

Post by **choralia** » 09 Jul 2013 19:01

Addendum: a nice European search engine crawler named Acoon has attacked www1 today, which is suspended by the hosting provider since several hours due to the large usage of resources caused by it. And this happened in spite of the robots.txt file that ask crawlers not to index any of our visitor websites. We have diverted already all visitor traffic to the other mirrors, anyway this shows that certain crawlers make life quite difficult for websites like ours.

Max

Post by **vaarky** » 12 Jul 2013 03:43

DId the acoon crawler use a registered account, out of curiosity?

Post by **choralia** » 12 Jul 2013 07:34

vaarky wrote:DId the acoon crawler use a registered account, out of curiosity?

No, it did not. The problem affected the www1 mirror, which is open to anonymous visitors. If a registered account were used, the crawler would have been automatically re-directed to the www contributor website instead.

Max

AlexMyltsev · Post by **AlexMyltsev** » 16 Jul 2013 15:35

choralia wrote:Opening further backdoors may increase the traffic to a critical level, so this should be justified by the number of user visits that originate from the associated search engines.

Could you please open a backdoor for Yandex? The robot will obey your Crawl-delay (http://help.yandex.com/search/?id=1112639), so you can control the additional server load that you incur.
The User-Agent is "Mozilla/5.0 (compatible; Yandex...)" as described on http://yandex.com/bots.
The authenticity of the robot can be checked with a reverse DNS lookup: http://help.yandex.com/search/?id=1112029.

Post by **choralia** » 22 Jul 2013 18:27

A backdoor for the Yandex crawler has been configured since July 17. Authentication is performed using both forward and reverse DNS lookup. So far, logs show that the crawler is able to enter the backdoor. The activity is quite limited, and much lower than the maximum allowed by directives in robots.txt.

Please let us know if there are any issues with this backdoor as seen from the Yandex side.

Max

Choral Public Domain Library

New structure of Choralwiki websites - November 29, 2009

New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009

Re: New structure of Choralwiki websites - November 29, 2009