Acts not constituting infringements of copyright in works (Australia)

February 8, 2010

I post these here for the purpose of future reference and use.

…for purpose of criticism or review

…for purpose of parody or satire

…for the purpose of, or is associated with, the reporting of news by means of a communication…

…for the purposes … of a report of a judicial proceeding.

They are “Acts not constituting infringements of copyright in works”. So I’m just reminding myself of these rights so that any time I wish to use material covered by these, I can just grab the text and link from here and note it next to the copy. Of course this is not really relevant here because this blog is hosted by a US company, but it would be help if I were to one day decide to self host in Australia.



A Look into the myschool.edu.au Data

February 7, 2010

After overcoming a few problems I managed to write a scraper for the myschool.edu.au data. Unfortunately they choose to put data in HTML, so the scraping process may have led my data to have some unknown errors. I publish (see bottom) the scraped data as I believe that per the IceTV v Nine Network [2009] HCA 14 case, any data that my scraper produces as output from the HTML input is not subject to the copyright of the original HTML content (this also means that I cannot publish the HTML pages).

I wish I could bzip2 up all those HTML pages and give them to you just to save your download, because the myschool.edu.au site doesn’t compress their pages when I tell them I accept gzip over HTTP, so it took up almost 2GB of quota to download all the HTML pages, oh well.

Some preliminary statistics from the data… (but they may be a bit wrong. UPDATE: I’ve realised I may have missed out a chunk of schools. I’ll rerun the scraper when I can spare the download.)

  • There are a total of 9316 schools. Of these,
    • 1538 are Secondary (of which 30% are non-government and 70% are government)
    • 1407 are Combined (of which 68% are non-government and 32% are government)
    • 6054 are Primary (of which 23% are non-government and 77% are government)
    • 317 are Special (of which 15% are non-government and 85% are government)
  • So,
    • 6451 are Government (69%),
    • 2865 are Non-government (31%)
  • These 9316 schools contain a total of 3 366 351 students of which,
    • 1 745 224 are male (51%)
    • 1 651 127 are female (49%)
  • The most schools in 1 postcode is 40, which are all in the postcode 2480.
  • The average student attendance rate is 92.007%
    • 91.870% for Government, 92.335% for Non-government
    • 89.205% for Secondary, 92.982% for Primary, 90.675% for Combined, 89.170% for Special.
  • There are a total of 265 960 teaching staff (full time equivalent of 241 408) and 124 117 non-teaching staff (full time equivalent of 86 511.9).

I could report a lot of stats like these above, all you need is a basic knowledge of SQL, but as much as I enjoy working out these stats I find graphs and graphics much more intuitive, so that is up next. Because of the vast dimensions to the data you can make all kinds of graphs so what would be best is a system to draw graphics dynamically which allows the user to decide what is graphed, but this takes more work so that is on the todo list.

I’ve also looked into doing some heatmaps using the geographical location of the schools, I could have used Google Maps, or I could use OpenStreetMap and libchamplain. Both have pros and cons… But for now I used Google Maps because their API is simple and I’ve always wanted to experiment with it, the downside is I’m not sure about the copyright of their maps and subsequently any derivative works. This image is just a test showing a dot for each school in the system, but its very easy to change the colour, size and opacity of the dots based on features of the school.

Schools in Sydney Map

Source code? Well the code I used to get all this data and this far is a bit ad hoc, because I started of as a test but I kept extending it more and more. So instead of posting code that doesn’t really work I’ll I clean it up and then post it  along with a more complete dataset. In the mean time here is a ‘|’ separated file which contains much of the myschool.edu.au data.

Update: I may have missed a bunch of schools in the data, I’m looking into this now.

Update: Source code, http://github.com/andrewharvey/myschool


anonymous FTP

January 22, 2010

So I’ve started reading a book about networks, and to complement this I’ve been taking a closer look at my network traffic in Wireshark (really great tool, by the way.).

So I pick an ftp site that I know, ftp://download.nvidia.com/ and see what happens in Wireshark when I visit it in Firefox. At the FTP application level this is what happens,

ftpsite to me: 220 spftp/1.0.0000 Server [69.31.121.43]\r\n
me to ftpsite: USER anonymous\r\n
ftpsite to me: 331 Password required for USER.\r\n
me to ftpsite: PASS mozilla@example.com\r\n
ftpsite to me: 230- \r\n
               230- ---------------------------------------------------------------------------\r\n
               230- WARNING:  This is a restricted access system.  If you do not have explicit\r\n
               230-           permission to access this system, please disconnect immediately!\r\n
               230 ----------------------------------------------------------------------------\r\n

But Firefox does not disconnect. So I did some more research and I found no references to “anonymous” users in either RFC 959 (FTP) or RFC 3659 (extensions to FTP). (Though there are references in latter RFCs, see RFC 2228).

The thing I find disconcerting is that I don’t think I have “explicit permission” to access this system. I (or rather Firefox) just guessed a username and password and they happened to let me in (what if I guessed a different username and password that wasn’t anonymous and it let me in?). If the RFC specified that a user of anonymous requires no password, or any password, then I would assume that the FTP server is granting me permission, but I assume rather people just started using anonymous as the user and it caught on…

The real problem here is that there are laws which govern such areas, and it doesn’t help that that I don’t understand what PART 6 – COMPUTER OFFENCES of the CRIMES ACT 1900 (NSW) is saying.



To fix broken audio, unplug faulty USB device.

January 19, 2010

How weird is this, just recently when I started up my computer lots of stuff was broken, no audio (and /proc/asound/cards was empty, normally it has “0 [Intel          ]: HDA-Intel – HDA Intel\nHDA Intel at 0xfa100000 irq 22″), libsensors weren’t reporting any values (eg. no CPU temp reported), eth0 dissapeared from NetworkManager, and probably a host of other things that I didn’t notice. Restarting didn’t fix it.

Well long story short, it turns out that everything magically fixed when I unplugged a USB hard drive that was plugged in. I had seen a lot of concerning messages sent to /var/log/messages from the kernel about it,

Jan 19 09:45:00 host kernel: [  564.100026] usb 1-3: reset high speed USB device using ehci_hcd and address 2
Jan 19 09:45:00 host kernel: [  564.237716] sd 8:0:0:0: [sdd] Unhandled error code
Jan 19 09:45:00 host kernel: [  564.237719] sd 8:0:0:0: [sdd] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK

that repeated every so often, but I never thought that a dodgy USB device would break the kernel from doing some of its job.


A Maths Problem: Transformed Horizontal Lines

January 7, 2010

This is the kind of post that I originally envisioned that I would post about when I started this blog. But after trying to complete this post I realised why I don’t do this very much, because I can’t always solve the problems I come up with. Anyway…

You can generate a funky kind of grid by taking a Cartesian coordinate system, joining lines from (0, t) to (t, 0) for some values t. Here are some examples,

Transformed Grid Montage

If you draw lots of lines you get something like,

Transformed Grid (50 lines)

Generated using code at http://github.com/andrewharvey/cairomisc/blob/bd41c8feb1beba38849878aed566c6f45a70856b/curved_grid_simple.pl

This is also what you get if you take a bunch of horizontal lines from x = 0 to x = 1 (where the horizontal lines are equally spaced above each other), and take all the endpoints from the line x = 1 and rotate them 90^\circ about the point 1, 0.

The thing I was interested in was as you draw more and more of these lines it looks like a curve emerges on the boundary. I imagined that if you drew infinitely many lines like these you would get a nice smooth curve. I want to know what is the formula for that curve. But as I started to try to work it out, it didn’t seem so simple.

I tried a lot of approaches, none of which seemed to work. So after a few initial set backs I tried to take a parametric approach taking t to be a value between 0 and 1 where this t indicates the line with start point (0, t). The point on the curve for this t is some point on that line. I tried to get that point via the intersection with the next line, ie. the point on this line that is also on the curve is between the intersection of that line and the line for t + \phi and t - \phi for some really small \phi. But when I tried this approach as you make \phi zero, then we get infinitely many points of intersection.

That didn’t work so easy but then I realised that if the point is on this line then (although I have not proved this but it seems obvious from the picture) that I have the gradient.

So all those lines as shown above have equation y = \frac{-t}{\left ( 1 - t \right )}x + t. (Except for t = 1 where we’ll just use a y value of 1). We can use this same t to define a point on the curve (which I call f from here on) parametrically. So I assumed that the gradient of f is given as f'(t) = \frac{-t}{1-t}. But now I’m not so sure that I have enough rigour here.

But then I got stuck again. I can try to go some integrals but this won’t work because you don’t know the relation between increasing t and the length along the curved you have moved. As you could have two different parametric functions which both have the same derivative function (ignoring the +c constant that disappears when you differentiate), just knowing the function defining the derivative of f parametrically won’t tell me the equation of the original curve.

Moving on I now tried to calculate the area under the curve. I could partition it like how a Riemann integral is done.

discret_areas_axis_withdots

We can easily calculate the area of any of these trapezoids (bounded by red). A = \frac{x_n - x_{(n-1)}}{2}(y_{x_n} + y_{x_{(n-1)}}). We can get the x values by finding the point of intersection of the 2 lines that intersect at that x (and have the largest y value if there are several points of intersection for that x). Each line for some value t will have a point of intersection of the line before and after it (based on the t value). When I say the area of t = some number, I mean the area of the trapezoid starting with the intersection of the previous t line and ending with the intersection of the next t line. So the area of t = 1 is zero (because x0 and x1 are the same). The diagram above has \phi = 0.125. So,

Point A is the intersection of y = \frac{-t}{\left ( 1-t \right )}x + t and y = \frac{- \left ( t + \phi \right )}{1-\left ( t + \phi\right )}x + t + \phi, which is,

x_A = (1-t)(1-t-\phi)
y_A = t(t+\phi)

Point B is the intersection of y = \frac{-t}{\left ( 1-t \right )}x + t and y = \frac{- \left ( t - \phi \right )}{1-\left ( t - \phi\right )}x + t - \phi, which is,

x_B = (1-t)(1-t+\phi)
y_B = t(t-\phi)

So the area of this trapezoid is \frac{x_B - x_A}{2}(y_A + y_B), which is 2t^2\phi(1-t)

But then I got stuck here. I can compute a value for the approximate area.

phi = 0.0001;
area = 0;
for (t = 1; t > 0; t -= phi) {
   area += 2*t*t*phi*(1-t);
}
print area;

Which gives a value very close to 1/6, and if I integrate that area equation for t = 0..1 you get \frac{1}{6}\phi. But I don’t want the area, I want the formula that defines the area from x = 0 to some value x so that I can then differentiate this to get the equation of the original curve. So this is where I give up, and leave this for another day. If you work it out please post in the comments!

Oh and there is some rough code I wrote to make those images here. And a nice animation too.


Law + Revision Control + Wiki

January 4, 2010

What happens when you mix a service like AustLii with version control system like Git with a wiki like editing system, and deliver it to the people through the web?

Well I haven’t tried, but it sounds like a good idea. You get a service that,

  • allows anyone to propose changes to laws (and work on branches) or draft and new laws, and
  • keeps track of the law and when it was changed (and which politicians/parties introduced those changes, who voted for them, etc…).

An Idea for a Media-Centreish Interface for a UNIX terminal/shell

December 18, 2009

Back in July or August this year when I was going through the notes on unix shells for COMP2041 I came up with idea of doing a shell/terminal interface that looked like an interface for a media centre ie. rather than looking like this,

manual page for man in xtermit would look “like” this (obvious not exactly the same but similar feel),

XBMC skin MediaStream by Team Razorfish. http://xbmc.org/wordpress/wp-content/gallery/mediastream/viewoptions.jpg

The key principles I had in mind were,

  • nice aesthetics
  • interface similar to a game or media centre
  • features easily discoverable for new users

My original motives were that I was just learning all these core-utils commands (ls, cat, mkdir, cp, mv…) and I found that although the shell had tab completion and apropos, it didn’t categorise these or give them in a list of common commands. Then I came up with more abstract ideas,

  • categorise common commands and give help on them. eg. File System: ls, cd, cd .., mkdir. Filters: cat, wc, grep…
  • parse commands and their argument list based on common styles (eg. GNU style, short -las and long -l –all –size) and provide contextual information (eg hovering over an –argument gives a one line message about what that argument does (perhaps parse the man file to get this info)) also auto-layout the command line as per the argument style.
  • it could also parse the pipe lines and display these much more visually so its easier to see what’s piping into what and allow the user to easily change the order/flow of the pipeline.
  • process management. don’t force the user to remember Ctrl+C and Ctrl+Z and bg and fg commands, show these as pause and stop icons.
  • redirection of output should be easily changed in the interface rather than just adding a < or > to the command line (and allow one to redirect STDOUT to a file AFTER the command has already run (because currently you would need to run the command again, or copy and paste and put up the with new lines that gnome-terminal puts in))
  • bookmarking commands (including argmunts) so that those common ones you use that you haven’t remembered yet are quick and easy to use.
  • colour STDERR in red.

I haven’t really thought about it on a technical level, but it may not be so portable as say gnome-terminal. I don’t know the really differences among different shells out there so I don’t know how dependent this is on bash or even if it ties bash and the terminal together, but from a beginner user perspective I don’t care about this.

The cloudy idea I have in my mind is basically a GUI/CLI hybrid but I think such a program would need to be careful not to go too far, because it could be made so that after doing an ls -la you could click on a file in the list and rename it, but then we are turning into a file manager in list mode (like Dolphin or Nautilus) which is unnecessary as those tools already exist.

I’m aiming to do come up with a list and more detailed list of requirements and a set of activity and use case scenarios, along with some wire-frame prototypes for such an interface soon. But for now I just needed to get it all out of my head an onto paper (and also public (in case someone tries to patent a concept)).


The Features of My Utopian Music Player

December 11, 2009

Ideally I would like to write my own music player because I don’t really like any that are currently available (Amarok 1.4, Amarok 2, Songbird, Rhythmbox, Banshee, Exaile). I like features from each but none seem to fit all my needs. All the time I keep rethinking what I should do and I still cannot decide. Anyway this is what my ideal music player would be like…

  • Backend Database
    • The backend metadata would be stored in an external Postgresql database, with the option for using sqlite for people who don’t want to set up and run postgresql.
    • The schema should be good and documented, so that a user can read and write into the database. If not at least give an interface to allow this.
    • Full playback information. I want my music player to store the timestamps of every time a given song has been played. I want history too, for instance the times of when the song rating was changed.
  • Collection Manger
    • I want the music player to be the library not just the librarian. I want to give it a file (say an MP3), along with details such as song title, artist, etc. and I want it to take that file and store it on the hard disk in a nice file structure (like iTunes does). Amarok 1.4 attempts to do this but its really hard, because initially it will just add the file to your playlist and not move it across to your collection, and even then if you change the details say the artist it will not correct this in the folder structure used to store that file.
    • Tagging songs. Amarok does this well.
  • Web scraper
  • Acoustic Analysis
    • Surely there are algorithms to guess the BPM (beats per minute) of a song. I want that integrated into the music player.
    • I need a moodbar so I can navigate a song, and to gather contextual information on how the style of the music varies over the song.
    • I don’t know much about acoustics, but there must be other algorithm which give meaningful measures of audio. These should be used to group songs and find similar ones.
    • This must be done locally, I don’t want to send things to web services (MusicBrainz, http://echonest.com/).
  • Navigation
    • I want a concept of a “Library” rather than a Playlist. Amarok only has playlists, but 99.9% of the time I want a list of all my songs.
  • Statistics

Now for the solution. I could try everything from writing my own music player from scratch that implements that all (but I gave up on that after I could not decide what programming language to use C, C++, Java, Perl, Python, what GUI widget toolkit to use Qt, GTK+, wxWidgets, graphics api for nice graphs Cairo, raw OpenGL, OpenGL behind Clutter, R’s graph drawing, Processing, or some other CPAN Perl module for drawing nice graphs. I can mix a few but the core app needs one programming language and it needs a core GUI toolkit for the GUI. There is too much choice and I don’t have enough experience to know before hand what is best and what I will find easiest and simplest to use.)

I could try to capture playback statistics by looping last.fm and audioscrobber.com to localhost and capturing the data that Amarok sends. Or I could just write a script for Amarok which captures playback, but this only solves part of the problem and then I’m stuck using a certain application. Alternatively I could just take an existing program and fork it to suit my needs.

There should be more to come on this as I start experimenting.


A Perl Script to Pause/Resume Amarok 1.4 Playback on Screensaver/Screenlock

December 11, 2009

I’ve just uploaded to GitHub a script to pause Amarok 1.4 playback when the screensaver/screenlock starts and up pause again when closed/unlocked. It addresses the issue I was having with the script at http://nxsy.org/getting-amarok-to-pause-when-the-screen-locks-using-python-of-course where the script would start Amarok if it was not running and it would restart playback on screensaver end/unlock regardless of whether it was playing when the screensaver started.

You could start the script on start-up or plug it into Amarok’s script engine to only be active when Amarok is active.

(Oh and in the future I’ll try to avoid posts that just duplicate item’s from other RSS/Atom feeds that don’t add much extra value.)


Saving the WordPress.com Export File and The Linked Media Files (and wget’s strictness)

December 7, 2009

So I’ve been wanting a way to automatically backup my wordpress.com export file. I decided to go for a bash and wget mix to do this work. But I soon had a problem wget won’t save cookies that have a path different to the file you are downloading. This is a problem because, well here is what I basically do to get the export file.

Grab wp-login.php. This will issue a cookie that WP looks for as proof that I can indeed store cookies.

Next I post login credentials to wp-login.php. This will issue a bunch of authentication cookies. Specifically,

Set-Cookie: wordpress_test_cookie=WP+Cookie+check; path=/; domain=.wordpress.com
Set-Cookie: wordpress=some_string; path=/wp-content/plugins; domain=.wordpress.com; httponly
Set-Cookie: wordpress=some_string path=/wp-admin; domain=.wordpress.com; httponly
Set-Cookie: wordpress_logged_in=some_string; path=/; domain=.wordpress.com; httponly
Set-Cookie: wordpress_sec=some_string; path=/wp-content/plugins; domain=.wordpress.com; secure; httponly
Set-Cookie: wordpress_sec=some_string path=/wp-admin; domain=.wordpress.com; secure; httponly

The problem is Wget will refuse to save number 2,3,5 and 6 (only saving wordpress_test_cookie and wordpress_logged_in). It refuses the rest because it requires the cookie path to be the same as the path of the file you are requesting. Using –debug wget says,

cdm: 1 2 3 4 5 6 7 8Attempt to fake the path: /wp-content/plugins, /wp-login.php
cdm: 1 2 3 4 5 6 7 8Attempt to fake the path: /wp-admin, /wp-login.php
cdm: 1 2 3 4 5 6 7 8Attempt to fake the path: /wp-content/plugins, /wp-login.php
cdm: 1 2 3 4 5 6 7 8Attempt to fake the path: /wp-admin, /wp-login.php

Specifically to get the export file I need the wordpress_sec cookie for the path /wp-admin. I can’t just request /wp-admin and try to get the cookie from there because only wp-login.php will let me post credentials.

Possible solutions are A) write a hacky solution that just grabs the cookie value using grep/sed and manually add this to the cookies file, B) recompile wget to accept some other argument that will accept these cookies, or C) don’t use wget.

I took a look at the source for wget, and it was easy to identify the problem area, I could just simply remove this segment,

/* The cookie sets its own path; verify that it is legal. */
 if (!check_path_match (cookie->path, path))
 {
 DEBUGP (("Attempt to fake the path: %s, %s\n",
 cookie->path, path));
 goto out;
 }

But then my download script wouldn’t be as portable and I’ll have to make sure I use and have the patched wget available.

I ended up using curl for some parts, but I probably could have done option A.

Anyhow, the script is here. It should grab the export xml file as well as any media files that it references and were uploaded to that wordpress.com blog.