YAPC::Europe: Thursday

Rija Ménagé

Sep 1, 2007 • 5 min read

it's just my notes, modified for obvious spelling errors fixes and URLs for the interesting bits. It may contains errors. I'll post proper and scoped articles later.

Unicode (by Juerd Waalboer)

characters
are not bytes

in 8 bit encoding, one char mas to
one byte
that means you can have at most 256 diff values
enough
for roman and Russian
enough for roman alphabet and Greek
not
enough for roman and Russian and Greek

multi
byte encodings

more bytes => more
characters
fixed width, variable width
unicode
encodings are all multi-byte
UTF-8 is very popular on the
Internet
UTF-16 is the internal encoding in MS Windows

"Character Set"
character set is character <->
number
Unicode is a charset
encoding is number
<-> bytes
UTF-8 is an encoding
MIME
calls them both "charset"
Perl calls them both "encoding"

2
kinds of strings:
perl has one string type
the
universe has several
"text string" and "binary string"
a.k.a
"character string" and "byte string"
the computer doesn't
know the diff
you should know

Unicode perl

text strings are unicode strings,
not UTF-8
ISO-8859-1 maps to 0..255, useful!
perl
keeps stings at ISO-8859-1 as longs as possible
if that
doesn't work, it upgrades to UTF-8 internally
if you mix the
two kinds, UTF-8 wins.

Prime rule:
Do
not mix byte strings with text strings
except if you
explicitly convert between them

decoding: bytes
-> characters (binary to text)
encoding:

first
slide:
All communication with the "outside world" is in bytes
something
has to decode their binary input to text
something has to
encode your text output to binary

read input
decode
input
process data
encode output
write
output

Neat trick:

Perl
lets you use code points (character numbers)
that do not yet
officially exist

In practice part 1:

use
Encode;

my $text = decode("UTF-8",$binary_input)
;
my $output = encode("UTF-8",$text) ;

you
have to encode otherwise the output will be character not binary

part
2:
let perl do the hard work!

binmode
STDIN, ":encoding(ISO-8859-1)" ;
binmode STDOUT
":encoding("UTF-8")" ; # don't forget the hyphen

print
while <> ;

Unicode semantics

perl
has unicode semantics
lc, uc,lcfirst, ucfirst
case
insensitivity
character classes like \w
perl also
has ASCII semantics :(
hard to tell which semantics will be
used for some operation
utf8::upgrade($your string) to
ensure Unicode semantics

in perl
5.9.5: perlunitut
in perl 5.9.5: perlunifaq
http://juerd.nl/site.plp/perluniadvice

Don't
use encoding.pm
it is broken and cannot be fixed. Using i
will hurt.

encoding::stdio

http://juerd.nl/perlunitut.html

http://www.cafepress.com/perl5/

Making

of ibeatgarry.com (by Karlheinz Zoechling)

Garry
Kasparov

The
oracle of Bacon at Virginia

do the
same with chess instead of movies
garry kasparov instead of
Kevin Bacon
The question: How many hops are needed to defeat
Garry at least transitively

data source:

Chessbase
Megabase 2005 (2007)
3 507 786 chess games
proprietary
data format
but can export to PGN (portable game notation,
Clear text format)

problem 1: max export is 2gb
files -> need split the export

Chess::PGN::Parse
from
PGN to PostgreSQL
ID is created

most
logic in sql

206 650 players

draw
game are discarded
short games less than 5 moves are
discarded too (defaulted games, drunk players, other silly stuff ...)

discard
games that aren't tournament games
leaves 2 385 622 games

->
graph problem -> don't know graph theory -> CPAN!
->
Shortest Path problem: Graph::*
-> interesting:
Dijkstra algorithm
-> said to be inefficient for
graphs with ends of equal length edges
-> finding:
seems to be true, long wait
-> rumor: Breadth first
search should be the best
-> No breadth first in CPAN
->
rolls his own
-> first use hashes
->
inefficient for graphs with ends of equal length edges
->
use array (improve performances by order 1 of magnitude)

->
approaching 2 pi

-> his garry kasparov
number is 4

problem for making web site
18
m Storable takes seconds to freeze

-> put
the graph into RAM : mod_perl

performance: 0.1s
average per query (array version): not too bad

not
good: as nb of instances increases, RAM usage explode
->
didn't find a way to share the graph across children

other
problem: Player names are not unique in Chessbase

esp.
for game-same player names appear before 1900 and after 1990, this cant
be.

solution: players who have a "gap" in their
playing records for more than 40 years will be treated as 2 (or more)
players. (Assumption)

-> rework tables,
rebuild Storable freeze

-> build caching
into the front end
computing chains takes times so queries
are stored in a table when they appear for the first time
(added
benefit: data for statistics)

-> redirect
uncached queries to backend

-> fill the
cache with "Kasparov queries", for a head start

can
link everyone to everyone

Zoechling, Karlheinz
Anderssen,
Adolf is the 1st world champion

Anatoly karpov
has a kasparov number of 1 and a Bacon number of 2 !

hits:
couple of thousands a day

Building

Scalable Data Collection (by mock)

huge quantity
of data from Akamai from various source, sometimes all at once

cron
based db insertion sucks
insert email

steal
good ideas from

perlbal, memcached, mogilefs,
db shards

glue together with POE the wrong way

from
Akamai up into db and mogileFS

scalable fast
architecture

queue-> reader ->
storage

larger lumps of data are faster to
process and transport

MogileFS data store

distributed
load balanced storage
uses mysql - too many inserts is bad

JSON as
compromise record encoding
aggregate data in large gzipped
files
index position of records in sql db

(JSON
access is fast)

2-3 months of data -> 60GB

db
reads scale with clusters
but db writes don't scale with
clusters

solution -> DB shards

mock
modified DBIx::Class
to work with sharded databases (not yet on CPAN, but its planned)

other
implementations:

Apache/mod_perl (faster in some
way but doesn't handle loads of transactions very well)
Event::Lib
(not mature)

issue of asynchronous work flow
-> need locking

mogilefs:
weirdness
with small records
not that fast with writes

Akamai:
services to push back data to content provider

pre
sharded version of pgsql

commercial alternative:
Sybase
IQ

all the nodes are load-balanced
with perlbal

mail:
[email protected]
web: http://sketchfactory.com

use
JSON::XS
(doesn't like unicode)

Perl

sucks and what to do about it (by Mark Fowler)

Installing perl program is hard

-> PAR

perl
-MCPAN -e 'install PAR::Packer'

pp -o hellow
hellow.pl

exec time
perl 0.35s
par
0.60s

->alternative -> build own
perl and ship it with the app
-> problem when moving
to a different machine (paths are hard coded so are different)
->
bleed to the rescue

when config perl add
-Duselocableinc

perl exception
handling

die means die not capture
exception
eval

if(blessed($@) &&
$@->isa("NoCheeseException"){

}

try
{

throw NoCheeseException "redo";
}
catch
NoCheeseException with {

}

above
is perl code

(see Error.pm)

->
problem (same as with eval)

in try{

return "this doesn't return from foo";
}

replace
return by rreturn

and add return allowed after
the catch

I hate the way perl
programs are just script

Template Toolkit tpage
solution
1: source filter

solution : build your own
executable

I want to
programmatically manipulate my code

PPI

cant tell the diff between certain perl constructs (like subroutine
prototypes)
but reliable

MAD

when
config perl
-Dmad=y

B::Generate
can
be used to created opcode

optomize.pm

real prog language can do compile time checking

use
typesafety;

typesafety::check()

Perl
worst practices