YAPC::Europe: Thursday
it's just my notes, modified for obvious spelling errors fixes and URLs for the interesting bits. It may contains errors. I'll post proper and scoped articles later.
Unicode (by Juerd Waalboer)
characters
are not bytes
in 8 bit encoding, one char mas to
one byte
that means you can have at most 256 diff values
enough
for roman and Russian
enough for roman alphabet and Greek
not
enough for roman and Russian and Greek
multi
byte encodings
more bytes => more
characters
fixed width, variable width
unicode
encodings are all multi-byte
UTF-8 is very popular on the
Internet
UTF-16 is the internal encoding in MS Windows
"Character Set"
character set is character <->
number
Unicode is a charset
encoding is number
<-> bytes
UTF-8 is an encoding
MIME
calls them both "charset"
Perl calls them both "encoding"
2
kinds of strings:
perl has one string type
the
universe has several
"text string" and "binary string"
a.k.a
"character string" and "byte string"
the computer doesn't
know the diff
you should know
Unicode perl
text strings are unicode strings,
not UTF-8
ISO-8859-1 maps to 0..255, useful!
perl
keeps stings at ISO-8859-1 as longs as possible
if that
doesn't work, it upgrades to UTF-8 internally
if you mix the
two kinds, UTF-8 wins.
Prime rule:
Do
not mix byte strings with text strings
except if you
explicitly convert between them
decoding: bytes
-> characters (binary to text)
encoding:
first
slide:
All communication with the "outside world" is in bytes
something
has to decode their binary input to text
something has to
encode your text output to binary
read input
decode
input
process data
encode output
write
output
Neat trick:
Perl
lets you use code points (character numbers)
that do not yet
officially exist
In practice part 1:
use
Encode;
my $text = decode("UTF-8",$binary_input)
;
my $output = encode("UTF-8",$text) ;
you
have to encode otherwise the output will be character not binary
part
2:
let perl do the hard work!
binmode
STDIN, ":encoding(ISO-8859-1)" ;
binmode STDOUT
":encoding("UTF-8")" ; # don't forget the hyphen
print
while <> ;
Unicode semantics
perl
has unicode semantics
lc, uc,lcfirst, ucfirst
case
insensitivity
character classes like \w
perl also
has ASCII semantics :(
hard to tell which semantics will be
used for some operation
utf8::upgrade($your string) to
ensure Unicode semantics
in perl
5.9.5: perlunitut
in perl 5.9.5: perlunifaq
http://juerd.nl/site.plp/perluniadvice
Don't
use encoding.pm
it is broken and cannot be fixed. Using i
will hurt.
http://juerd.nl/perlunitut.html
http://www.cafepress.com/perl5/
Making
of ibeatgarry.com (by Karlheinz Zoechling)
Garry
Kasparov
The
oracle of Bacon at Virginia
do the
same with chess instead of movies
garry kasparov instead of
Kevin Bacon
The question: How many hops are needed to defeat
Garry at least transitively
data source:
Chessbase
Megabase 2005 (2007)
3 507 786 chess games
proprietary
data format
but can export to PGN (portable game notation,
Clear text format)
problem 1: max export is 2gb
files -> need split the export
Chess::PGN::Parse
from
PGN to PostgreSQL
ID is created
most
logic in sql
206 650 players
draw
game are discarded
short games less than 5 moves are
discarded too (defaulted games, drunk players, other silly stuff ...)
discard
games that aren't tournament games
leaves 2 385 622 games
->
graph problem -> don't know graph theory -> CPAN!
->
Shortest Path problem: Graph::*
-> interesting:
Dijkstra algorithm
-> said to be inefficient for
graphs with ends of equal length edges
-> finding:
seems to be true, long wait
-> rumor: Breadth first
search should be the best
-> No breadth first in CPAN
->
rolls his own
-> first use hashes
->
inefficient for graphs with ends of equal length edges
->
use array (improve performances by order 1 of magnitude)
->
approaching 2 pi
-> his garry kasparov
number is 4
problem for making web site
18
m Storable takes seconds to freeze
-> put
the graph into RAM : mod_perl
performance: 0.1s
average per query (array version): not too bad
not
good: as nb of instances increases, RAM usage explode
->
didn't find a way to share the graph across children
other
problem: Player names are not unique in Chessbase
esp.
for game-same player names appear before 1900 and after 1990, this cant
be.
solution: players who have a "gap" in their
playing records for more than 40 years will be treated as 2 (or more)
players. (Assumption)
-> rework tables,
rebuild Storable freeze
-> build caching
into the front end
computing chains takes times so queries
are stored in a table when they appear for the first time
(added
benefit: data for statistics)
-> redirect
uncached queries to backend
-> fill the
cache with "Kasparov queries", for a head start
can
link everyone to everyone
Zoechling, Karlheinz
Anderssen,
Adolf is the 1st world champion
Anatoly karpov
has a kasparov number of 1 and a Bacon number of 2 !
hits:
couple of thousands a day
Building
Scalable Data Collection (by mock)
huge quantity
of data from Akamai from various source, sometimes all at once
cron
based db insertion sucks
insert email
steal
good ideas from
perlbal, memcached, mogilefs,
db shards
glue together with POE the wrong way
from
Akamai up into db and mogileFS
scalable fast
architecture
queue-> reader ->
storage
larger lumps of data are faster to
process and transport
MogileFS data store
distributed
load balanced storage
uses mysql - too many inserts is bad
JSON as
compromise record encoding
aggregate data in large gzipped
files
index position of records in sql db
(JSON
access is fast)
2-3 months of data -> 60GB
db
reads scale with clusters
but db writes don't scale with
clusters
solution -> DB shards
mock
modified DBIx::Class
to work with sharded databases (not yet on CPAN, but its planned)
other
implementations:
Apache/mod_perl (faster in some
way but doesn't handle loads of transactions very well)
Event::Lib
(not mature)
issue of asynchronous work flow
-> need locking
mogilefs:
weirdness
with small records
not that fast with writes
Akamai:
services to push back data to content provider
pre
sharded version of pgsql
commercial alternative:
Sybase
IQ
all the nodes are load-balanced
with perlbal
mail:
[email protected]
web: http://sketchfactory.com
use
JSON::XS
(doesn't like unicode)
Perl
sucks and what to do about it (by Mark Fowler)
Installing perl program is hard
-> PAR
perl
-MCPAN -e 'install PAR::Packer'
pp -o hellow
hellow.pl
exec time
perl 0.35s
par
0.60s
->alternative -> build own
perl and ship it with the app
-> problem when moving
to a different machine (paths are hard coded so are different)
->
bleed to the rescue
when config perl add
-Duselocableinc
- perl exception
handling
- die means die not capture
exception - eval
if(blessed($@) &&
$@->isa("NoCheeseException"){
}
try
{
throw NoCheeseException "redo";
}
catch
NoCheeseException with {
}
above
is perl code
(see Error.pm)
->
problem (same as with eval)
in try{
return "this doesn't return from foo";
}
replace
return by rreturn
and add return allowed after
the catch
- I hate the way perl
programs are just script
Template Toolkit tpage
solution
1: source filter
solution : build your own
executable
- I want to
programmatically manipulate my code
PPI
cant tell the diff between certain perl constructs (like subroutine
prototypes)
but reliable
MAD
when
config perl
-Dmad=y
B::Generate
can
be used to created opcode
optomize.pm
real prog language can do compile time checking
use
typesafety;
typesafety::check()
Perl
worst practices