Warning: fwrite(): supplied argument is not a valid stream resource in /var/www/www.schuirink.net/www/xml/headlines.php on line 383

Warning: fclose(): supplied argument is not a valid stream resource in /var/www/www.schuirink.net/www/xml/headlines.php on line 384
planet_postgreql @ the web & the world :: hundreds of fresh newsfeeds on schuirink.net
schuirink.net
main destinations: home | the web & the world | out of here
Google

news headlines

News headlines collected from 498 newsfeeds.

Planet PostgreSQL

url: http://planetpostgresql.org

Devrim Gündüz: Redirecting planetpostgresql.org to planet.postgresql.org -- please update your bookmarks


When I started Planet PostgreSQL project  about 6.5 years ago, we had only very few (almost none!) bloggers -- but we had to start at some point. As time moved on, lots of people started blogging, and we had invaluable blog posts about various PostgreSQL related topics.


When community decided to fork the planet project because of valid reasons, it increased the visibility of blogs, because it was also feeding postgresql.org main page. It also has more bloggers. However, we had tons of people who were following "my" planet, because they already bookmarked it and never ever heard about new planet.


Anyway, I'll keep it short: As of today, I will be stopping updating this planet. This will be the latest post on this planet. Please use http://planet.postgresql.org  as the new Planet PostgreSQL URL.


I will redirect domains in a few months after making sure that (almost) everyone updated their bookmarks.


There will be no changes in people.planetpostgresql.org.


Regards, Devrim



Selena Deckelmann: 2010 retrospective


I’m barely back from a 2-week vacation, and still easing my way back into Portland living.

Looking back on 2010, the highlights were:

  • Listed as a major contributor to PostgreSQL. In typical community-style, I found out by chance by looking at the contributors website one day, weeks after the change was made. Sneaky devils. :)
  • Gave talks at LinuxConf.AU in New Zealand, LibrePlanet in Boston, PgCon in Ottawa, CouchCamp in the Bay area, Grace Hopper Celebration of Women in Computing in Atlanta, LinuxFest Northwest in Bellingham, WA, Google Summer of Code Mentor’s Summit in Mountain View, and keynoted DevNation in Portland.
  • Helped run the first ever IgniteGov at GOSCON in Portland!
  • Spoke at and participated in KiwiFoo and FooCamp. KiwiFoo changed my life. At some point, I’d love to live in New Zealand.
  • Made stickers with Reid and distributed OSFY, FSFY and OALFY stickers across the world.
  • Lead the second successful Open Source Bridge conference. The second edition brought in Danny O’Brien, as he started his work with the Committee to Protect Journalists, Leigh Honeywell, who chronicled the rise of hackerspaces around the world, and Mayor Sam Adams. Danny’s talk was an inspiration to me, and I’m looking for opportunities to create better and more secure software for people who fight political oppression.
  • Started work at Emma, a lovely group of people and tons of interesting problems to work on.
  • Learned Python. Well, still learning python. :)
  • Took a couple vacations – Fourth of July weekend in Nashville, TN and two weeks in St Croix in December.
  • Spent a week in Montana with my mom for the first time in many years and had Thanksgiving with family at her mountain-man hideout near Tally Lake.
  • Drove to and from from Montana. I love road trips.
  • Started the PDX11 site to help publicize what the city and citizens are working together to create an even better and more inviting software community in Portland.
  • Worked on trolluniversity.com with Duke and Bart. Lots of scotch, ontologies and awkward jokes expected in the future.
  • Saw one of the world’s largest tesla coils turned on and run in a giant warehouse. Twice. And learned a tiny bit more about lightning, Tesla and high voltage.
  • Connected with a ton of new friends in open source communities, and hope to make many more in the new year.

I’m sure I left a few things out. I wish I would have written this post last year for 2009, as it was also an amazing year, full of travel, new experiences and wonderful people. Maybe I’ll do that and just backdate it for myself. I forgot that I’d done a picture inspiration last year for 2010. I substituted scuba diving for flying – cheaper and quite a bit safer. :) Otherwise.. mostly all came to pass.

Another thing about this year: I’m still not sure what it is that I want to become. I’ve had many conversations over the last six months with women who are at a similar stage of life – stable work/career, considering having children, loving travel, obsessed with software and communication.

What comforted me is that they were sitting with the ambiguity of it all as well, and finding ways of sifting through probabilities to home in on what’s truly important. I think happiness is an important factor in decision making about life’s goals, but there’s a big part of me that’s more apt to go for interestingness instead. So, while sometimes things have gone off the rails or been incredibly difficult in the last year, all the experiences I listed above have made the sacrifices and contemplative time worth it.

Related posts:

  1. Foocamp 2010: lovely, expectant, reflective



Hubert Lubaczewski: Waiting for 9.1 ? Unlogged tables


On 29th of December, Robert Haas committed interesting patch, which does: Support unlogged tables.   The contents of an unlogged table aren't WAL-logged; thus, they are not available on standby servers and are truncated whenever the database system enters recovery. Indexes on unlogged tables are also unlogged. Unlogged GiST indexes are not currently supported. (edited [...]

Endpoint Team: Version Control Visualization and End Point in Open Source


Over the weekend, I discovered an open source tool for version control visualization, Gource. I decided to put together a few videos to showcase End Point's involvement in several open source projects.

Here's a quick legend to help understand the videos below:

The branches and nodes correlate to directories and files, respectively. In the case of the image to the left, the repository has a main directory with several files and three directories. One of the child directories has one file and the other two have multiple files.
A big dot represents a person, and a flash connecting the person and a file signifies a commit.
White + blue dots represent current End Point employees.
White + grey dots represent former End Point employees.
White dots represent other people, out there!

The Videos

Interchange from endpoint on Vimeo.

pgsi from endpoint on Vimeo.

Spree from endpoint on Vimeo.

Bucardo from endpoint on Vimeo.

One of the articles that references Gource suggests that the videos can be used to visualize and analyze the community involvement of a project (open source or not). One might also be able to qualitatively analyze the stability of project file architecture from a video, but this won't reveal anything definitive about the code stability since external factors can influence file structure. For example, since I am intimately familiar with the progress of Spree, I can identify when Spree transitioned to Rails 3 in the video, which required reorganization of the Spree core functionality (read more about this here and here).

In the case of this article, I wanted to highlight End Point's involvement in a few open source projects where we've had various levels of involvement. We've contributed to Interchange since 2000. We've been involved in Spree less lately, but had more presence in early 2009. In the smaller projects Bucardo and pgsi, End Point employees have worked on a team to be the primary contributors to the projects in addition to a few external contributors. Open source is important to End Point, and it's great to see our presence demonstrated in these cute videos.



PostgreSQL Weekly News: PostgreSQL Weekly News January 2nd 2011


PostgreSQL 9.1alpha3 released! Get it here: http://developer.postgresql.org/pgdocs/postgres/release-9-1-alpha.html Here's how to test it: http://wiki.postgresql.org/wiki/HowToBetaTest

Pavel Stehule: pltoolbox released


Hello
I released new version of pst collection. This version was enhanced about a regress tests and documentation. At the same time I renamed this package to PL toolbox, because functions from this package are intended for using in stored procedures.
Happy New Year

Andrew Dunstan: What was in your stocking?


One of my Christmas gifts this year was this charming little statue of Ganesha, the Hindu God with an elephant's head. Some years ago I was going to create a piece of software named for Ganesha, a Postgres-backed blog, but as often happens life and other things got in the way. He's used in the logo for Dunslane Consulting LLC, and is a very appropriate God to be associated with PostgreSQL, being recognized by Jains and Buddhists as well as Hindus (so he has wide appeal), and is a patron of arts and sciences and deva of intellect and wisdom, according to Wikipedia.

And he's cute.

Pavel Stehule: Bitmapset for PL/pgSQL


PostgreSQL has a nice functions for operations over bitmapset. A bitmapset is set of positive integers. It should be smaller than array, and, what is important for me. it support very fast test if some value is in set or not. PostgreSQL internally use this functions, but there are not possibility to use it from SQL (PL/pgSQL). I wrote a wrapper for this functionality and moved it to PST collection - http://pgfoundry.org/frs/download.php/2909/pstcoll-10-12-29.tgz.so there is some preview:
pavel=# select pg_column_size(pst.bitmapset '{1,2,3,4}');
pg_column_size
----------------
8
(1 row)

pavel=# select pg_column_size(array[1,2,3,4]);
pg_column_size
----------------
40
(1 row)

pavel=# select pst.add_members('{}', 1, 2, 4,8);
add_members
-------------
{1,2,4,8}
(1 row)

pavel=# select pst.is_member(pst.add_members('{}', 1, 2, 4,8), 8);
is_member
-----------
t
(1 row)

pavel=# select pst.bitmapset_union('{1,2,3}','{6,2,9}');
bitmapset_union
-----------------
{1,2,3,6,9}
(1 row)
Without bitmapsets we have to use a arrays. But bitmaps are better for storing some flags - it's more adequate tool.
pavel=# select count(*) from omega;
count
---------
1010000
(1 row)

pavel=# select bitmapset_collect(a) from omega;
bitmapset_collect
--------------------------
{0,1,2,3,4,5,6,7,8,9,10}
(1 row)

Time: 558.667 ms

pavel=# select array_agg(distinct a) from omega;
array_agg
--------------------------
{0,1,2,3,4,5,6,7,8,9,10}
(1 row)

Time: 3859.567 ms
Using a bitmapset is about 7x faster.
pavel=# select del_members('{2,3,4,5,1}',2,3);
del_members
-------------
{1,4,5}
(1 row)

pavel=# select bitmapset_is_subset('{1,2,3}','{1,3}');
bitmapset_is_subset
---------------------
f
(1 row)

pavel=# select bitmapset_overlap('{1,2,3}','{1,3}');
bitmapset_overlap
-------------------
t
(1 row)

pavel=# select bitmapset_difference('{1,2,3}','{1,3}');
bitmapset_difference
----------------------
{2}
(1 row)
Probably this simple wrapper can be enhanced - some GiST or GIN index can be nice. The bitmapset is limited only on integers. There is not simple way to use it together with enums for example.

Leo Hsu and Regina Obe: PostgreSQL 9 High Performance Book Review


PostgreSQL 9.0 High Performance In a prior article we did a review of PostgreSQL 9 Admin Cookbook, by Simon Riggs and Hannu Krosing. In this article we'll take a look at the companion book PostgreSQL 9 High Performance by Greg Smith.

Both books are published by Packt Publishing and can be bought directly from Packt Publishing or via Amazon. Packt is currently running a 50% off sale if you buy both books (e-Book version) directly from Packt. In addition Packt offers free shipping for US, UK, Europe and select Asian countries.

For starters: The PostgreSQL 9 High Performance book is a more advanced book than the PostgreSQL 9 Admin Cookbook and is more of a sit-down book. At about 450 pages, it's a bit longer than the PostgreSQL Admin Cookbook. Unlike the PostgreSQL 9 Admin Cookbook, it is more a concepts book and much less of a cookbook. It's not a book you would pick up if you are new to databases and trying to feel your way thru PostgreSQL, however if you feel comfortable with databases in general, not specific to PostgreSQL and are trying to eek out the most performance you can it's a handy book. What surprised me most about this book was how much of it is not specific to PostgreSQL, but in fact hardware considerations that are pertinent to most relational databases. In fact Greg Smith, starts the book off with a fairly shocking statement in the section entitled PostgreSQL or another database? There are certainly situations where other database solutions will perform better. Those are words you will rarely hear from die-hard PostgreSQL users, bent on defending their database of choice against all criticism and framing PostgreSQL as the tool that will solve famine, bring world peace, and cure cancer if only everyone would stop using that other thing and use PostgreSQL instead:). That in my mind, made this book more of a trustworthy reference if you came from some other DBMS, and wanted to know if PostgreSQL could meet your needs comparably or better than what you were using before.

In a nutshell, if I were to contrast and compare the PostgreSQL 9 Admin Cookbook vs. PostgreSQL High Performance, I would say the Cookbook is a much lighter read focused on getting familiar with and getting the most out of the software (PostgreSQL), and PostgreSQL High Perofrmance is focused on getting the most out of your hardware and pushing your hardware to its limits to work with PostgreSQL. There is very little overlap of content between the two and as you take on more sophisticated projects, you'll definitely want both books on your shelf. The PostgreSQL 9 High Perofrmance book isn't going to teach you too much about writing better queries,day to day management, or how to load data etc, but it will tell you how to determine when your database is under stress or your hardware is about to kick the bucket and what is causing that stress. It's definitely a book you want to have if you plan to run large PostgreSQL databases or a high traffic site with PostgreSQL.

PostgreSQL 9 High Performance is roughly about 25% hardware and how to choose the best hardware for your budget, 40% in-depth details about how PostgreSQL works with your hardware and trade-offs made by PostgreSQL developers to get a healthy balance of performance vs. reliability, and another 35% about various useful monitoring tools for PostgreSQL performance and general hardware performance. Its focus is mostly on Linux/Unix, which is not surprising since most production PostgreSQL installs are on Linux/Unix. That said there is some coverage of windows such as FAT32/NTFS discussion and considerations when deploying terabyte size databases on Windows and issues with shared memory on Windows.

Full disclosure: I got a free e-Book copy of this book just as I did with PostgreSQL 9 Admin Cookbook.


Continue reading "PostgreSQL 9 High Performance Book Review"

Magnus Hagander: Feedback from PGDay.EU the final part - the venue and registration


The big change for PGDay.EU this year really was the switch from a university venue (first Monash University in Prato, then ParisTech in Paris) to a hotel venue (The Millennium Hotel in Stuttgart). We believe that much of the rest of the conference was an improvement over previous years - but it was an incremental improvement, whereas the change of venue was rather drastic. Looking at the feedback on this, I think we can conclude that this change was in general a positive one:

1139387658_2VXMu-O.png

We're seeing a total of 75% who rate the venue as a 4 or a 5. Looking at the freetext comments, a large majority of them are very positive, but there are a few ones that stand out:

  • Several people mentioned it was bad that the two sets of rooms (Berlin vs non-Berlin rooms) were very far apart. This is definitely something that we noted, and will attempt to avoid next year.
  • A few people mentioned that it would be nice if the hotel was closer to the city center. This is definitely true - unfortunately, closer to the city center means higher prices. We hope to find something closer to a city center at a reasonable price next year - by making sure we start to look and book early enough.
  • A few people commented that we shouldn't hold this in northern/central Europe in December due to weather (snow anyone?). Our goal is to move the conference back to an earlier date during the autumn - again, the main reason we ended up in December this year was that we started looking for a venue too late.
  • A couple of people commented that the hotel room rates were too high at the Millennium. There were cheaper hotels around to use - but of course, those aren't as convenient. This wasn't helped by the fact that the hotel group rate dropped off the hotel website twice, causing some people to get their reservations at a higher rate.
  • Isolated people commented that they did not like the hotel - "too big, unpersonal" and "feels like a prison".

Amongst the positive ones we find a large number of comments saying that the "integrated venue" or "all inclusive" venue was a great step up.

Closely related to the venue, is the food. Unlike the big north American conferences PGCon and PG-East/West, we have for the past two years tried to provide proper lunches and not just sandwiches/boxed lunches. This obviously costs more money, but we believe it's worth it, and we think our visitors do. Last year we had a catering firm bring us assorted food, mainly cold cuts, at the conference venue, and this year we got proper lunch buffets (including multiple choices for dessert, of course..) at one of the hotel restaurants. I think the ratings speak for themselves - I would encourage those other conferences to look into improving their lunches as well!

1139387657_HLpDJ-O.png

A full 82% rated the food as 4 or 5. In the end, the cost for paying for a lunch "on ones own bill" would probably have cost more than half the conference fee - so we think we managed to provide some very good value. In fact, several people rated the food as being the best part of the conference(!)

There was, however, one person who said the food was one of the worst things about the conference - if you recognize that was you, we would very much like to know exactly why (no details were included) - please send me an email or write a comment here!

A few people commented on the large amount of food left over from lunch on at least one of the days - it is up to the hotel to decide what to do about that, but it is our belief that they do something "reasonable" with it - and not just throw it away. We know that the caterers last year delivered all leftovers to a nearby homeless shelter, for example. For next year, we will attempt to again get a specification from the catering/restaurant as to what happens to leftovers.

We feel that the overwhelming majority of our visitors found the changes an improvement, and we will therefor pursue something similar as our primary option for next year. We are always interested in improving further, of course, so if you have any other ideas - let us know! The final question we asked about the venue was where to hold the conference next year. Many were quite ambiguous in their suggestions ("big city in Europe" is in, "Hawaii" is out because we want to stick to Europe). Summarizing what we could gave us the following:

1139392945_EeBeh-O.png

  • Obviously, we see a bias towards Germany - since we were in Germany this time. However, we are only going back to Germany next year as a last resort - we want to move around. We will eventually come back to Germany of course - but not next year.
  • Some people commented that they will not be able to attend in a country other than Germany because they wouldn't understand the language of the talks. To deal with this, we are considering adding non-local-or-english talks as well for next year independent of where it is - where German talks (along with French and maybe Spanish) would be included even if the conference isn't in Germany.
  • Our Germany community is also looking into creating a specific PGDay Germany next year, which will be a smaller event focused on the local market - something we as PostgreSQL Europe will help and encourage.
  • I'm surprised to find Stockholm so high up on the list - I promise I didn't put any of those votes in there myself!
  • It's good to note that all the cities having 2 or more suggestions were already on our list of places to look at for next year.
  • We will consider this input and start looking for venues. This time we will not attempt to decide and announce a city first and find a venue later, we'll do it in the other order.

The final part of our evaluation was considering the conference website and registration:

1139387653_s4bHN-O.png 1139387651_Yoeab-O.png

In general these are very good rates. I'm happy to see that more than 50% rate the website overall experience as 4 or 5 - that's a much better rating than it's being given by the people who edit the content on it! Same for registration, with very few people rating it really low. There's clearly some room for improvement though:

  • A few people commented they wanted non-paypal registration options. While the paypal system we use actually allow you to do a credit card payment without the need to sign up for paypal (which some people did not realize and thus sent us an email before registering asking about it), not everybody has a credit card (this is not America - or Sweden). We'd be very happy to hear suggestions for what to do here though - we've looked at many different options, and paypal turned out to be by far the best one. We need something that supports automation and is reasonably fast. We did also support bank transfer in extraordinary cases - but that's not something that can be automated (unless you are a much bigger customer to the bank than we are), and it takes a long time for some payments, since they have to cross borders. So - any suggestions are welcome, and our core registration system is designed to support multiple payment methods.
  • Nobody actually wrote in the conference feedback that we lack a good interface for bulk registration, but we are aware of this - we had a few (less than 10 in total) entities wanting to register more than 2-3 persons at the same time for a single invoice, and our current system does not provide a reasonable way of dealing with this. This is definitely something we need to work on for next year.
  • It's been suggested we add a "skill level" entry to each talk, to make it easier for an attendee to know if it's a beginner or advanced talk. This is definitely something we'll look at doing for next year.
  • One suggestion is we include a full list of all attendees including their email address in the conference handouts, to make it easier to contact each other. This is not something we're going to do as a general thing, since we don't want to go distributing such lists. But we may consider adding it as an opt-in feature, where you can choose on registration if you want to be included in such a list.
  • Several people suggested adding videos of the talks - either as realtime streaming or as downloadables. We're not likely to add a real-time streaming, but we are considering doing talk recording. It does add a fairly large amount of work though, so we'll be needing more volunteers to cope with it...
  • We need to make it more clear that 5 is the best and 1 is the worst on the feedback forms. We know a few people filled them in wrong (we hope it meant they gave us bad rates when they meant good, but we don't know that), and it was also mentioned in the feedback.

In summary, here are some reasons in graphical and textual forms why you should already put attendance to next years PostgreSQL Conference Europe in your budget:

1139387650_axwxQ-O.png

Freetext comments:
  • "The overall organization of that event was excellent."
  • "Very good organization, great people, interesting talks, vibrant community in general. Lots of core dev presents, high level of knowledge."
  • "Great organization from beginning (registration at the website, information prior to the event), arriving and registering (internet access already available, great t-shirt and backpack) to the conference itself (sessions, warning speakers about how much time is left), good food and drinks at the breaks and at lunch. Kudos to the organizers and everyone who helped make this happen."
  • "I think the organisation was perfect. There where many people and all know where they had to go to."
  • "The huge amount of information, inspiration and positive energy. Actually I hacked my first patch on the way back."
  • "The people especially the staff :-) Both keynotes were stimulating good dsicussions with my peers"
  • "Very good conference. I felt really cosy there. As a noob to PG, I got a lot of information and I lost the fear of asking the experts (either on the mailing list or on IRC)."
  • "The organization was really great. Maybe the best PostgreSQL conference I've attended so far."

That concludes my summaries of the feedback from this years PGDay.EU conference. If your specific comments haven't been called out here, don't worry - we still read them all and will consider them all for next year!

Finally, thanks again to all who helped make this conference great!

See you again next year!



PostgreSQL Weekly News: PostgreSQL Weekly News December 26th 2010


Happy Boxing Day from the PostgreSQL Weekly News!

Leo Hsu and Regina Obe: String Aggregation in PostgreSQL, SQL Server, and MySQL


Question: You have a table of people and a table that specifies the activities each person is involved in. You want to return a result that has one record per person and a column that has a listing of activities for each person separated by semicolons and alphabetically sorted by activity. You also want the whole set alphabetically sorted by person's name.

This is a question we are always asked and since we mentor on various flavors of databases, we need to be able to switch gears and provide an answer that works on the client's database. Most often the additional requirement is that you can't install new functions in the database. This means that for PostgreSQL/SQL Server that both support defining custom aggregates, that is out as an option.

Normally we try to come up with an answer that works in most databases, but sadly the only solution that works in most is to push the problem off to the client front end and throw up your hands and proclaim -- "This ain't something that should be done in the database and is a reporting problem." That is in fact what many database purists do, and all I can say to them is wake up and smell the coffee before you are out of a job. We feel that data transformation is an important function of a database, and if your database is incapable of massaging the data into a format your various client apps can easily digest, WELL THAT's A PROBLEM.

We shall now document this answer rather than trying to answer for the nteenth time. For starter's PostgreSQL has a lot of answers to this question, probably more so than any other, though some are easier to execute than others and many depend on the version of PostgreSQL you are using. SQL Server has 2 classes of answers neither of which is terribly appealing, but we'll go over the ones that don't require you to be able to install .NET stored functions in your database since we said that is often a requirement. MySQL has a fairly simple, elegant and very portable way that it has had for a really long time.


Continue reading "String Aggregation in PostgreSQL, SQL Server, and MySQL"

Christophe Pettus: Extra columns when doing .distinct() in a Django QuerySet


tl;dr: If you are doing a .distinct() query and limiting the results using .values() or .values_list(), you may be in for a surprise if your model has a default ordering using the Meta value ordering. You probably want to clear the ordering using .order_by() with no parameters.


If a model is ordered, either by .order_by() on the QuerySet or a Meta ordering value, it will always include that field in the QuerySet. This is true even if the query uses .distinct(). To quote the documentation:

Any fields used in anorder_by() call are included in the SQL SELECT columns. This can sometimes lead to unexpected results when used in conjunction with distinct().

(The documentation as written implies that is only problem with related models, but as we’ll see, it’s a problem in general. A documentation patch is probably in order here.)

By way of illustration, let’s assume you have the following models:

from django.db import models

class Publisher(models.Model):
    name = models.TextField()

    class Meta:
        ordering = [ 'name', ]

class Book(models.Model):
    title = models.TextField()
    topic = models.TextField()
    publisher = models.ForeignKey(Publisher)

    class Meta:
        ordering = [ 'title', ]

And we create some rows:

pub = Publisher(name="Strange But True Publications")
pub.save()

And some books:

book1 = Book(title="New Topics in Industrial Meringue Production",
             topic="Cooking",
             publisher=pub)
book1.save()

book2 = Book(title="Your Chicken's First Song Book",
             topic="Animal Husbandry",
             publisher=pub)
book2.save()

Now, we want to get the list of IDs of the publishers, and we’re using the cool optimization that I described earlier, with the optimization a commenter suggested (thanks!):

>>> q = Book.objects.values_list('publisher_id', flat=True).distinct()
>>> print q
[1, 1]

Um, wait. That’s not right. Why would it return 1 twice when we said .distinct()? Let’s look at the SQL (you are doing a tail -f on the PostgreSQL logs while you develop, right?):

LOG:  statement: SELECT DISTINCT "x_book"."publisher_id", "x_book"."title" FROM "x_book" ORDER BY "x_book"."title" ASC LIMIT 21

And there we have it. It includes the title field in the query, even though it doesn’t return it. Since the DISTINCT thus applies to both, we have two distinct rows, rather than one.

The fix, fortunately, is easy; just clear the ordering with a .order_by() without any parameters:

>>> q = Book.objects.values_list('publisher_id', flat=True).distinct().order_by()
>>> print q
[1]

And the query:

LOG:  statement: SELECT DISTINCT "x_book"."publisher_id" FROM "x_book" LIMIT 21


Magnus Hagander: Feedback from PGDay.EU - the speakers


The next issue of my "pie-chart-overflow blog posts about PGDay feedback" is about our speakers. The speakers are, if that's not obvious, the reason that people come to the conference. Having good speakers is an absolute requirement if we want to keep up the quality of the conference. Other things like venue and price are certainly important, but nothing compares to the actual content of the conference - which is provided by our speakers.

I'm very happy to say that we seem to have manage to keep the very high numbers for Speaker Quality that we had from last year (differing less than 3% which is well within the margin of error). The same goes for the scores our speakers got on their knowledge of the topic - indicating that we've managed to attract some of the most skilled speakers in the world. Which is not surprising given that in many cases, we the person speaking about a feature is actually the guy who wrote it. What is more surprising is that these same people are rated as very good speaker - which we all know isn't always true about your stereotypical developer.

1134342926_xzhmk-O.png 1134342924_gbA6T-O.png

Just like last year, we're not going to post the complete list of speaker ratings, given that they are easy to read wrong. But here is a list of our top speakers, excluding any that had less than 5 ratings. Any speakers who have fewer than 10 should be considered a very uncertain number, and I've again included the standard deviation to determine the uncertainty. We had a lot more speakers this year, so I have only included those scoring 4 or above this time around. Each speaker has received his own detailed score, of course.

Place Speaker Quality Score Standard deviation Number of votes
1 Dimitri Fontaine 4.8 0.5 8
2 Mason Sharp 4.7 0.9 11
2 Magnus Hagander 4.7 0.7 29
4 Simon Riggs 4.6 0.7 52
4 Simon Phipps 4.6 0.9 45
6 Andreas Scherbaum 4.5 0.7 34
6 Ed Boyajian 4.5 1.1 33
8 Bruce Momjian 4.4 0.9 54
8 Gianni Ciolli 4.4 0.8 38
8 Tim Bunce 4.4 1.0 10
11 Jan Aleman 4.2 1.0 11
12 Tim Child 4.1 0.8 9
12 Michael Meskes 4.1 1.2 10
14 Bernd Helmle 4.0 0.6 6
14 Heikki Linnakangas 4.0 0.8 30
14 Linas Virbalas 4.0 0.9 10

The list based on Speaker Knowledge looks slightly different, but not very much. Given that our speaker knowledge has been rated even higher than speaker quality, I've only included those who scored 4.6 or higher (which is a fantastically high cutoff)

Place Speaker Knowledge Score Standard deviation Number of votes
1 Tim Child 5 0 9
2 Joe Conway 4.9 0.3 10
3 Simon Riggs 4.8 0.7 52
3 Linas Virbalas 4.8 0.4 9
3 Magnus Hagander 4.8 0.8 29
3 Dimitri Fontaine 4.8 0.5 8
7 Andreas Scherbaum 4.7 0.8 34
7 Bruce Momjian 4.7 1.0 53
9 Mason Sharp 4.6 1.2 11
9 Heikki Linnakangas 4.6 0.8 30
9 Simon Phipps 4.6 1.0 45
9 Gianni Ciolli 4.6 0.8 38
9 Tim Bunce 4.6 1.3 10
9 David Fetter 4.6 0.6 16

A great big thanks to all our speakers - you did a fantastic job.

We will need to work hard to keep up our recruiting of speakers for next years. If you were considering but decided not to submit a talk for some reason - please let us know why, so we can improve! Or if you have any ideas in general on our processes around this. For example, we had no female speakers at all this year - we know you're out there, and we certainly want you there, so what do we need to change to make this more interesting for you as a potential speaker? The same goes for other groups that we were missing of course: now is the time to let us know so we have the time to change things before next year!



Magnus Hagander: Feedback from PGDay.EU - the contents


This blog seems to be turning into a PGDay blog rather than a general PostgreSQL blog. But I promise I'll get back to some more technical content soon - or at least that I'll try.

A couple of days ago we closed the feedback system from PGDay.EU 2010, and have been busy tallying the result. It turns out that my constant nagging on people to please fill out the feedback worked - we got a lot more feedback this year than last year. That also means there's a lot more work in going through mainly all the freetext comments - that's the price I have to pay, I guess. In total we had around 60 people who left "full conference feedback", which is almost double from last year. It's still only just over 25% of the attendees, so it could certainly be even better yet. We also had 86 people who left session feedback (this is around 40% and a much better number of course) for a total of 570 session feedback entries.

So what did the feedback say - time for some pie charts! We've actually seen a slight decrease in the ratings for topic importance. This may well be because we've broadened the topics more. We're still seeing very good grades for content quality, which reinforces my feeling that our speakers deliver very valuable content to the attendees, and that the conference is well worth attending. (As a note to readers - I've had several people point out to me that german people are used to rating 1 being the highest and 5 being the lowest, so there may be some skewing in the voting because of this. Even though the pages very clearly stated that 5 is the highest, this is something we need to make even more clear for next year)

1134327279_P3NvS-O.png 1134331393_Gk3nY-O.png

We spent a lot of time trying to put together the puzzle that is the schedule for so many talks over so short time. It turns out that we did a good job in general, but there was a large amount of overlap where people wanted to go to many talks at the same time. We also received a lot of comments in the freetext fields about this, and this is definitely something that we will consider for next year. It would probably have been better content-wise to have three tracks spread over three days (maybe not entirely complete) rather than four tracks over two days, but that would also have increased many of the costs with 33% which is a lot of money...

1134332171_UQHa3-O.png 1134332230_n5q7E-O.png

Of course, the "Hallway track" is a very important part of any conference like this, and this year we collected specific feedback on this side. I'm very happy to see that more than two thirds of our attendees rated the learning part of the hallway track as 4 or 5, and well over half found it a good way to connect with other people in the community!

1134329207_QMR5T-O.png 1134329160_XnGKu-O.png

If these numbers don't make you interested in next years PostgreSQL Conference Europe then, really, you're reading them wrong...

That's enough pie-charts for one post. I will follow this up with more feedback summary on our speakers and on our venue once it's ready.



Christophe Pettus: Getting the ID of Related Objects in Django


tl;dr: Don’t retrieve a whole row just to get the primary key you had anyway. Don’t iterate in the app; let the database server do the iteration for you.


There’s a couple of bad habits I see a lot in Django code (including, sadly, my own), which is abuse of a ForeignKey field. Let’s take the classic example:

class Publisher(Model):
    # We accept the default 'id' column
    name = TextField()
    ...

class Book(Model):
    # Likewise
    title = TextField()
    topic = TextField()
    publisher = ForeignKey(Publisher)
        # Remember this creates a publisher_id column

Now, let’s say we have a book:

b = Book.objects.get(title="Interior Landscapes")

And we want the ID of the publisher.

Don’t do this:

pub_id = b.publisher.id

This works, but it’s absurd: It does a separate select to fetch the entire Publisher object, and then extracts the ID. But, of course, it already had the ID, because that’s how it retrieved the publisher object. Instead, just go straight to the created ID field:

pub_id = b.publisher_id

Next, don’t use iteration to build lists if you can get the data directly out of the database. For example, suppose we want the list of publishers who publish books with topic “Surreal Architecture”. Far too often, I see this:

surreal_books = Books.objects.filter(topic="Surreal Architecture")

surreal_publishers = set([book.publisher.id for book in surreal_books])

In this case, Django will send one query to get the list of books, and then do a separate query for each publisher to get the publisher id… even though they’re already in memory.

surreal_publishers = set([book.publisher_id for book in surreal_books])

This is better, since it doesn’t have to retrieve each publisher, but far better is to make the database do all the work:

surreal_publishers_qs = Books.objects.filter(topic="Surreal Architecture").values('publisher_id').distinct()

The result set, in this case, is a bit of an odd duck: It’s a list of dictionaries, each dict being of the form { 'publisher_id': <id value> }. Of course, Python being Python, it’s not hard to transform that into a set:

surreal_publishers = set([entry['publisher_id'] for entry in surreal_publishers_qs])

And we didn’t have to do any raw SQL!



Leo Hsu and Regina Obe: PL/R Part 3: Sharing Functions across PL/R functions with plr_module


In Part 2 of PL/R we covered how to build PL/R functions that take arrays and output textual outputs of generated R objects. We then used this in an aggregate SQL query using array_agg. Often when you are building PL/R functions you'll have R functions that you want to reuse many times either inside a single PL/R function or across various PL/R functions.

Unfortunately, if you wanted to call a PL/R function from another PL/R function, this is not possible unless you are doing it from spi.execute call. There is another way to embed reusable R code in a PostgreSQL database. In order to be able to share databases stored R code across various PL/R functions, PL/R has a feature called a plr_module. In this tutorial we'll learn how to create and register shareable R functions with plr_module. In the next part of this series we'll start to explore generating graphs with PL/R.


Continue reading "PL/R Part 3: Sharing Functions across PL/R functions with plr_module"

PostgreSQL Weekly News: PostgreSQL Weekly News December 19th 2010


PostgreSQL 9.0.2, 8.4.6, 8.3.13, 8.2.19, 8.1.23 updates released. If you are using Hot or Warm Standby, or PITR, you should upgrade as soon as possible. The 8.1 series is now end-of-life, and users are encouraged to upgrade to a newer major version as soon as possible. http://www.postgresql.org/docs/9.0/static/release.html

Andrew Dunstan: Merry Christmas (or whatever you celebrate)


Quite a number of people have been asking me for a very long time to make the code that runs the PostgreSQL Build Farm server available. For various reasons I have not done so, but today I have made it available. It is published in a new git repo at https://github.com/PGBuildFarm/server-code.

At the same time, I have moved the client code to git and published it under the same github organization. It can be seen at https://github.com/PGBuildFarm/client-code. The old CVS repo on pgfoundry will no longer be used and will receive no further updates. However, I will continue to publish client releases there.

Enjoy!

Hubert Lubaczewski: Two years of explain.depesz.com


First of all – just today I committed patch for Pg::Explain – which is the workhorse behind explain.depesz.com. This patch fixes calculation of exclusive time for explain nodes, and the best thing about it is – I didn’t write it. It’s full patch provided by someone else – Filip Rembia?kowski – my former colleague, friend, [...]

Christophe Pettus: Comparing NULLs Considered Silly


tl;dr: You can’t compare NULLs. A nullable primary key is a contradiction in terms. You can’t join on NULL, so a NULL foreign key refers to nothing, by definition. NULL doesn’t do what you think it does, no matter what you think it does.


NULL in SQL is annoyingly complex.

There’s really no conceptual model of NULL that will not end up surprisingly you in unpleasant ways. Jeff Davis, last year, wrote a great blog post that, if could be so bold, could be paraphrased as “conceptual models of NULL considered harmful.”

Thus, it’s not surprising that some… well, surprising ideas about NULL sometimes pop up.

Recently, on the Django developers’ list, the phrase “nullable primary key” caught my eye. This inspired me to write these thoughts about NULL, and in particular NULL being used as keys.

First:

It ie meaningless to compare two NULL values.

I’ve noticed application programmers often treat NULL as a magic value that any type can possess (I’ve been quite guilty of this, too). While this is somewhat true, it’s also a dangerous path to go down, because:

NULL = NULL

… is NULL, not true. Whatever else you can say about NULL, a NULL value means you can make no claims about what value it is. Saying, “I have no idea what this value is, and I have no idea about what that value is, but are they equal?” is, I would hope, pretty self-evidently meaningless.

Now, this immediately implies:

You cannot join on NULL.

If a foreign key column is NULL, you can’t do an inner join on it to another table, even if key column(s) being referred to is NULL. This follows directly from the fact that you can’t compare NULL values; joining is just built around comparison, after all.

Yes, you can do things like:

SELECT a.*
    FROM a
    INNER JOIN b
        ON (b.col = a.col) OR ( (b.col IS NULL) AND (a.col IS NULL) )

Setting aside that you’ve pretty much committed yourself to a nested loop at this point (and thus a very expensive operation), the fact that you have to jump through this hoop should be an indication that the wrong path is being trod.

So, please remember: NULL in a foreign key field does not mean “This refers to rows in the other table that have a matching NULL,” because there’s no such thing as a “matching NULL.”

Moving on to primary keys:

A primary key is a combination of columns whose values, taken together, uniquely specify a row.

Thus, a nullable primary key is equally meaningless, as the whole point of a primary key is for it it to be compared to other values to determine uniqueness. (The SQL standard prohibits NULLs in primary key columns, so it’s not just a good idea, it’s the law, or at least the recommendation.)


The SQL standard calls for NULL to be thrown up in places where it really should require an error. For example:

SELECT SUM(col) FROM t WHERE FALSE

… returns NULL, as the result of any aggregate function over zero rows is NULL. But the sum of no numbers is 0, not “unspecified” (or whatever you want to call NULL).

Worse:

SELECT AVG(col) FROM t WHERE FALSE

is NULL, while:

SELECT 0/0

… much more rationally gives a divsion-by-zero error.

My guess is that the SQL standards committee is loathe to have the spec require errors for more-or-less common operations, and that’s where a lot of the stranger cases of NULL come from, as a way of having a normal-but-flagged return from an edge case.

It’s really a shame that NULL is so complex and counterintuitive, but there’s really no hope for it except to learn the rules, and not try to abuse NULL to do things it wasn’t designed for.



Selena Deckelmann: Vacation, Dec 18-Jan 3


I’m taking an extended vacation. While it’s possible that I may check my email while I’m out, it is not likely. :)

Sorry for the short notice! I just realized today that I need a bit of a break from email. In fact, I may just delete everything out of my inbox rather than spend hours trying to sort through it when I get back.

So, while I’m out, if there’s something you need my help with, here’s what I suggest you do:

* If it’s PostgreSQL related, contact pdxpug@postgresql.org for local stuff, or pgsql-advocacy@postgresql.org for other advocacy related things.
* If it has to do with some other form of volunteer activity, I don’t have a good backup for me and it will probably just have to wait.
* If I’m really the only person that can help, queue up an email draft and send me a message when I return (After Jan 3, 2011).

Have a lovely holiday season! And we’ll see you in the new year.

Related posts:

  1. Q&A about Hot Standby



Joshua D. Drake: PgEast 2011: NYC Call for papers


December 16th, 2010: Celebrating 15 years of PostgreSQL, early.

Following on the smashing success of PostgreSQL Conference West, PostgreSQL Conference West, The PostgreSQL Conference for Developers, End Users and Decision Makers, is being held at the Hotel Pennsylvania, in New York City from March 22nd through 25th 2011. Please join us in continuing to make this the largest PostgreSQL Conference series!

  • PostgreSQL Conference
  • Call for papers

    Thank you to our sponsors:

  • Command Prompt, Inc.
  • EnterpriseDB

    Time line:

    Dec 16th: Talk submission opens
    Feb 10th: Talk submission closes
    Feb 15th: Speaker notification
    This year we will be continuing our trend of covering the entire PostgreSQL ecosystem. We would like to see talks and tutorials on the following topics:

          * General PostgreSQL: 
                  * Administration 
                  * Performance 
                  * High Availability 
                  * Migration 
                  * GIS 
                  * Integration 
                  * Solutions and White Papers 
          * The Stack: 
                  * Python/Django/Pylons/TurboGears/Custom 
                  * Perl5/Catalyst/Bricolage 
                  * Ruby/Rails 
                  * Java (PLJava would be great)/Groovy/Grails 
                  * Operating System optimization
                    (Linux/FBSD/Solaris/Windows) 
                  * Solutions and White Papers 
    



  • Bruce Momjian: Conference Report


    I just returned from conferences in Germany and Italy which, as usual, allowed me to easily learn about new technologies. I have linked to these excellent talks in hopes others can benefit:

    • Devrim's talk about Red Hat Cluster software was very interesting. Postgres relies on OS-specific tools for auto-failover and, though I have recommended Red Hat Cluster several times, I did not know much about it before Devrim explained it.
    • Similarly, David Fetter has been talking about Writeable Common Table Expressions (CTE) for several years, but I only understood their value after seeing his talk.
    • Another eye-opener for me was Jean-Paul Argudo's talk about Pgpool-II and its many features.
    • The surprise talk was Gianni Ciolli's talk (photo, details) about writing Postgres code to play chess.


    Andrew Dunstan: Still on CVS?


    I was under the impression that Postgres was one of the last major FOSS projects using CVS before we switched to using git. Certainly that's what Josh Berkus said in a recent article. But today I found here that both MinGW and Cywin, both fairly substantial projects, are still using CVS. So I wonder how many others there are, and if they are considering moving.

    On a related note, the WIndows git installation is amazingly big. It's about 172Mb of stuff. By contrast, an entire Mingw/Msys suite sufficient to build Postgres takes less than 1Gb. I know disk space is cheap, but there's no reason to be totally profligate with it. We've seen a few cases recently where buildfarm runs have encountered out of space conditions, and I'm looking at ways to have the client clean up more as it goes.

    Andrew Dunstan: And the good news is ...


    Today my new server machine arrived. I'll play around with it a bit, until CentOS 6 is released, when I'll start setting it up in earnest.

    Also today I managed to get mostly to the bottom of the Mingw issue I talked about the other day. I have a tentative fix, and I think we just need to make sure of one or two things before we make it right. Basically I used git bisect to narrow down the issue to one particular commit. It was quite cool. Next time I'll write a script for bisect to run, though.


    Christophe Pettus: Using Server-Side PostgreSQL Cursors in Django


    This is a follow-up to the previous post, in which we talked about ways of handling huge result sets in Django.

    Two commenters (thanks!) pointed out that psycopg2 has built-in support for server-side cursors, using the name option on the .cursor() function.

    To use this in Django requires a couple of small gyrations.

    First, Django wraps the actual database connection inside of the django.db.connection object, as property connection. So, to create a named cursor, you need:

    cursor = django.db.connection.connection.cursor(name='gigantic_cursor')
    

    If this is the first call you are making against that connection wrapper object, it’ll fail; the underlying database connection is created lazily. As a rather hacky solution, you can do this:

    from django.db import connection
    
    if connection.connection is None:
        cursor = connection.cursor()
           # This is required to populate the connection object properly
    
    cursor = connection.connection.cursor(name='gigantic_cursor')
    

    You can then iterate over the results using the standard iterator or cursor.fetchmany() method, and that will grab results in from the server in the appropriate chunks.



    Andrew Dunstan: Frustration


    In attempting to get a new Mingw buildfarm member working, I discovered that there's a rather dramatic and nasty failure when it's built with the latest mingw toolset. If the client passes connection options to the server the server fails miserably with
    FATAL: parameter "port" cannot be changed without restarting the server
    It doesn't matter what the options setting is. Anything valid at all triggers it.

    I've followed several false trails in trying to find what's causing this, and generally wasted a huge amount of time on it. So far I do know that it's apparently not a client problem (and thus not related to putenv, as several of us thought it might be), not related to compiler optimization, not related to compiler version, and not related to the system's getopt version.

    For now I'm going to have to give up and do some actual paying work. I'll try to return to it later. Given other demands, that could be some weeks away.

    Christophe Pettus: Very Large Result Sets in Django using PostgreSQL


    tl;dr: Don’t use Django to manage queries that have very large result sets. If you must, be sure you understand how to keep memory usage manageable.


    One of the great things about modern interpreted, garbage-collected languages is that most of the memory management happens behind the scenes for you. Unfortunately, sometimes, the stage equipment comes crashing through the backdrop in the middle of the performance.

    In Django, this frequently happens when manipulating tables that contain a very large number of rows. Here are some tips on how to not end up with the “behind the scenes” machinery landing in the audience’s lap.

    For purposes of this discussion, let’s define “very large” as being bigger than is comfortable to keep in memory for the appplication.

    When Django executes a query and reads the results, memory is being taken up several places to hold the results of the query:

    1. On the database server, it needs to keep around structures holding the result of the query. Most database servers are good about not keeping any more rows of the result in memory than they absolutely have to, and in any event, it’s pretty much out of Django’s control what the database server stores. So, we shall trust the database to do the right thing, and move on.

    2. Inside of the Django application, some set of the rows that will ultimately be the result of the query need to be stored while Django processes them.

    3. The Django QuerySet object can (although it does not always) cache some of the results of the query as Model objects.

    4. And, of course, the application might hang on to some of the objects that come back (for example, for display on the web page). Of course, this is directly under the control of the application author. We’ll trust you to do the right thing, and move on.

    First, let’s talk about Django’s caching in the QuerySet object.

    Query Set Caching

    Django’s QuerySet objects serve two roles: They’re data structures representing an SQL query, and an API to access the results of the query. There’s no explicit “do this query now, please” operation in Django (although some operations have that as, shall we say, a strong implication); by and large, Django waits until you try to get the results of a query before executing it. So, until you get the first object out of the query set, Django won’t have even executed the query.

    Django also has a caching mechanism built into QuerySet. This cache stores the objects that are manufactured from the rows as they come back from the database, so that multiple accesses to the same object from the same QuerySet will return the cached object instead of a new copy.

    Note, however, the emphasis on from the same QuerySet. A surprisingly large number of operations clone the QuerySet before operating on it. For example:

    qs = ExampleModel.objects.filter(name='Fred')
    x = qs[2]
    x = qs[2]
    

    This will do two queries. qs[2], under the hood, clones the query set, applies a limit of [2:3] to it, executes the query, returns the resulting object, and throws the limited QuerySet away. Slicing does exactly the same thing.

    However, there is an exception. If you do this:

    qs = ExampleModel.objects.filter(name='Fred')
    list(qs)
    x = qs[2]
    x = qs[2]
    

    … the access pattern will be very different. list(qs) forces the evaluation of the query set, so Django will send the query to the database server, and populate the QuerySet (and its cache) with the result. Then, the qs[2] operaitons don’t copy the QuertSet; it just hits the cache.

    Note, though, that this came at the expense of retrieving every row that matched the query from the server, and creating objects for it. If you force the QuerySet to be evaluated, Django creates objects for everything that matches the query.

    When you iterate over a QuerySet, the behavior is slightly different. The QuerySet cache is always built from the first object that matches the query on up; it’s not sparse (for example, you’ll never have the situation where qs[4] and qs[1000] are in the cache, but the objects between them aren’t). As you iterate over a QuerySet, if the cache is not already populated, Django grabs the rows in chunks (currently hard-coded to be 100) and fills the cache ahead of the iterator. This does mean that if you do a query, then only iterate over the first few elements, the cache doesn’t fill up with stuff you are never going to look at.

    You can defeat the caching by using .iterator(). For example:

    qs = ExampleModel.objects.filter(name='Fred')
    for x in qs.iterator():
       do_something_wonderful(x)
    

    This will execute the query, and return each resulting object back, but without filling the cache. (It also won’t return cached objects if they already exist; .iterator() forces a reexecution of the query.) As the Django documentation says, this can be handy if it is a huge result set.

    So, let’s say you for some reason want to process 100 million rows. You know for sure that you won’t be able to hold all 100 million Model objects in memory, so you dutifully do:

    qs = GiganticTableModel.objects.all()
    for giant in qs.iterator():
        # And BANG, you get an out of memory exception right here.
    

    But what happened? Why did you run out of memory before you even saw a single object out of .iterator()?

    Daddy, Where Do Model Objects Come From?

    Let’s take a moment and trace down the code path that gets executed here:

    • Creating the QuerySet doesn’t touch the database at all, as noted above.

    • After a certain amount of fussing around, .iterator() calls the underlying backend query machinery to perform the query.

    • The backend machinery executes the query, and creates an iterator over the resulting rows. That iterator (in this case) grabs a chunk of rows at a time using .fetchmany(), and returns them one at a time. (At it happens, that chunk is hardcoded at 100 rows.)

    • That iterator is called by the actual iterator returned by .iterator(), so the iteration (pfew!) proceeds as: Call to get a row (which refills if the last grab of 100 is exhausted), create a new object, and return it up. Create an object from that row, return it to the caller.

    So, why are we getting an out of memory condition? Even though there are 100 million rows in the result, there should only be 100 in memory at any one time, right?

    Sadly, wrong. At the moment that the backend machinery executes the query, all 100 million rows are returned by the database server at once.

    To quote the psycopg2 documentation:

    When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.

    This is true even if you do a .fetchone() or .fetchmany(), not just a .fetchall(). And there’s no way, while staying entirely within the standard Django QuerySet machinery, to change this behavior.

    So, what do we do?

    “Doctor, It Hurts When I Do That.”

    “So, don’t do that.”

    If at all possible, don’t process very large result sets directly in Django. Even setting aside the memory consumption, it’s a horribly inefficient use of pretty much every part of the toolchain. Much more appealing options include:

    1. Use .update() to push the execution into the server.

    2. Use a stored procedure or raw SQL.

    Modern database servers are designed to crunch large result sets; leave the data on the server and do it there.

    Take Smaller Bites

    If there is a way of partitioning the data up into smaller chunks, do that. (For example, processing by day, or ID range.) Although I wouldn’t exactly call it “best practice,” you could iterate through the rows by using ranges of the primary key, assuming a standard Django serial integer PK:

    i = 0
    while True:
        qs = GiganticTableModel.objects.filter(pk__gte=i*1000, pk__lt=(i+1)*1000)
        try:
            for giant in qs:
                do_something_wonderful(giant)
        except GiganticTableModel.DoesNotExist:
            break
    
        i += 1
    

    There’s also an example here of constructing an iterator that does much the same thing.

    Use a Database-Side Cursor

    The way that databases really deal with this problem is cursors. Not the Python DBI cursor, in this case; server-side cursors are a structure which holds the result of a query and allows the client to read portions of it at will without having the whole thing shipped across.

    They’re wonderful, and Django should use them. It doesn’t. However, you can, using direct SQL.

    To create a cursor in PostgreSQL in the server, first, we need to have a transaction open. For the full details about Django transaction management, check out some of my earlier blog posts. This is required because the type of cursors we’ll be using will only persist for the duration of the transaction.

    Now, the SQL sequence looks something like this. Instead of saying:

    SELECT * FROM app_gigantictablemodel;
    

    We say:

    DECLARE gigantic_cursor BINARY CURSOR FOR SELECT * FROM app_gigantictablemodel;
    

    (The BINARY keyword allows it to use the more-efficient binary protocol between the database server and the application.)

    Then, to get results, we can just say:

    FETCH 1000 FROM gigantic_cursor;
    

    … or however many rows we want to get.

    And then, we can just iterate over them (of course, we’re getting the rows as rows, rather than objects):

    cursor = connection.cursor()
        # Remember that this 'cursor' is a different thing than the server-side cursor!
    cursor.execute("DECLARE gigantic_cursor BINARY CURSOR 
                        FOR SELECT * FROM app_gigantictablemodel")
    
    while True:
        cursor.execute("FETCH 1000 FROM gigantic_cursor")
        rows = cursor.fetchall()
    
        if not rows:
            break
    
        for row in rows():
            ...
    

    Now, there’s something that should work great, but doesn’t. In 1.2, Django introduced raw SQL queries that return a RawQuerySet. So, one could in theory do this:

    qs = GiganticTableModel.objects.raw("FETCH 1000 FROM gigantic_cursor")
    

    Except we get an exception:

    Raw queries are limited to SELECT queries. Use connection.cursor directly for other types of queries.
    

    Presumably to guard against weird errors, raw queries do a hardcoded check against the query string, making sure it starts with SELECT. It would be nice if this were liberalized to allow FETCHes.


    So, if you must, you can process gigantic result sets in Django. But, ideally, you should design your application to make it unnecessary. If you must process large result sets, do it on the database server; that’s what it’s there for.



    Robert Hodges: Interested in Sponsoring Tungsten Open Source Features?


    Over the last few months I have been pleasantly surprised by the number of people using open source builds of Tungsten.  My company, Continuent, has therefore started to offer support for open source users and will likely expand these services to meet demand.

    There have also been a number of requests to add specific features to open source builds, especially for replication. We have added a few already but are now considering pushing even more features into open source if we can find sponsors.  These add to a number of great features already in open source like global transaction IDs, MySQL 5.0/5.1, basic drizzle replication, transaction filtering, and many others. 

    Do you have special replication or clustering features you would like to see added to Tungsten? Specialized MySQL to PostgreSQL replication?  Management and monitoring commands?  Cool parallel replication problems?  High-performance logging?  Weird multi-master topologies?  Talk to us about sponsoring new open source features.  We're happy to do projects that solve interesting problems, benefit the open source databases community, and help grow Tungsten as a product. 

    Visit the Continuent website or send email directly to robert dot hodges at continuent dot com.  

    PostgreSQL Weekly News: PostgreSQL Weekly News December 12th 2010


    The Call for Papers is still open for the PostgreSQL room at FOSDEM, which will be held in Brussels, Belgium, on February 5-6, 2011. Send your proposals to fosdem AT postgresql DOT eu.

    Hubert Lubaczewski: Waiting for 9.1 ? KNNGIST


    On 4th of December, Tom Lane committed really cool patch: KNNGIST, otherwise known as order-by-operator support for GIST. This commit represents a rather heavily editorialized version of Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1 patches. I redid the opclass API to add a separate Distance method instead of turning the Consistent method into an illogical mess, fixed some [...]

    Guillaume Lelarge: Attendees of PGDay.eu 2010, send your feedback please


    Now that PGDay.eu 2010 is over, this is a good time for us to think about what we did right and, perhaps more important, what we did wrong. You can help us on that work by telling us what you think about the venue, the caterer, and the talks (and I mean, each of them, including the keynote).

    Thanks.



    Andreas Scherbaum: PGDay.EU 2010 - Kart racing


    Author
    Andreas 'ads' Scherbaum

    On tuesday evening, after the conference, we went to a kart racing arena near Stuttgart. This event was announced in the first mail to all participants, although there was no reminder because all available places in the cars (for transportation to the kart racing) were already reserved after a short time.

    We had 20 drivers, splitted up in 3 groups. Each group had 10 minutes for qualifying and a 10 minute race. Although someone gave me all the papers with the results, there are no names on most of the sheets. So I cannot say who won each of the three races.


    Continue reading "PGDay.EU 2010 - Kart racing"

    Joshua D. Drake: R.I.P. PostgreSQL (Mammoth) Replicator


    In 2004, PostgreSQL Core Team member Josh Berkus wrote[1]:

    "Slony-I is undoubtedly our most popular replication tool. It supports Master-Slave High Availability Replication. However, there are a number of other solutions, such as dbMirror, eRServer, pgPool, C-JDBC, and the proprietary Mammoth Replicator, all of which are in wide use because they solve different replication problems than Slony-I does. Replication is not a single solution for a single problem; it is several solutions for a wide array of different problems. That's why no one replication tool will ever be the "default" replication for PostgreSQL."

    I believed well before then it was a fallacy and a bad community decision to not put support behind a single, integrated solution. I also believe it has been one of the single most important policy failures of the .Org community and has lead PostgreSQL to grow much more slowly than it otherwise could have.

    From a business standpoint it opened the door for many other solutions, a lot of learning experiences and a lot of professional services consulting (Thank you: Slony). It also opened the door for the only solution to look at being integrated into Core. Mammoth PostgreSQL Replicator (Replicator).

    Replicator opened the door for CMD in many ways. Although it was never a huge commercial success we did have early customers (Cisco) and there was quite a few deployments over time. Eventually, we Open Sourced Replicator hoping to generate some external interest, and although quite a few hackers talked to me about participating, none of it actually materialized into anything useful.

    To this day, nothing can touch the usability of Replicator for PostgreSQL replication. It is the easiest to configure, the easiest to run, the easiest to manage. Alas that is not enough, as we needed to re-architect to eliminate the one single point of failure (as well as some other issues, crash safety I am looking at you). Further with the emergence of the Hot Standby technology that is in 9.0, the need for Replicator has become even less.

    Although there were design decisions originally made that were incorrect, I still believe that the overall architecture of Replicator was sound. Implementation, perhaps not but architecture yes. I also believe that the in development version (1.9) would have been a game changer as a whole. Unfortunately we were not able to execute for a number of reasons that don't really matter anymore.

    Everyone who worked on replicator should be proud of what they did. It should be looked on as a learning experience. We built something, nobody else did, even now. We built integrated and flexible PostgreSQL replication. Was it perfect? No. Should we have open sourced it sooner and tried harder to integrate it into the community? Probably. Were there design decisions that should have been different? Yes. Were there components we should have changed (MCP SPOF)? Yes.

    Of course, we can say all of that about the current state of PostgreSQL. Let alone Replicator, Slony or any other project. Just read -hackers. We are not perfect, but our team did something good.

    It took until 2010, 7 years? after Replicator first released for the community to figure out they were wrong. We were pioneers for PostgreSQL and our team should be proud of that.

    R.I.P. PostgreSQL (Mammoth) Replicator (12/10/2010)

    1. http://archives.postgresql.org/pgsql-advocacy/2004-11/msg00063.php



    Pavel Stehule: Iteration over record in PL/pgSQL


    Hello

    A iteration over record in PL/pgSQL is old well known problem. When we try to write a general triggers in PL/pgSQL we can find a break, because we are not able to iterate over record. There was a some workarounds, but these solutions are simply slow and complex. So I wrote a set of functions that can helps.

    First functions is record_expand. This function is similar to unnest function, but related object is record:

     
    postgres=# select * from pst.record_expand(row('10',null,'Ahoj', current_date));
    name | value | typ
    ------+------------+---------
    f1 | 10 | unknown
    f2 | | unknown
    f3 | Ahoj | unknown
    f4 | 2010-12-10 | date
    (4 rows)
    Now isn't problem to write general trigger for detecting a changed colums:

    CREATE TABLE foo(a int, b int, c text, d int);

    CREATE OR REPLACE FUNCTIO update_trg_func()
    RETURNS trigger as $$
    DECLARE r record;
    BEGIN
    FOR r IN SELECT n.name, o.value as oldval, n.value as newval
    FROM pst.record_expand(new) n,
    pst.record_expand(old) o
    WHERE n.name = o.name
    AND n.value IS DISTINCT FROM o.value
    LOOP
    RAISE NOTICE '% % %', r.name, r.oldval, r.newval;
    END LOOP;
    RETURN NULL;
    END;
    $$ LANGUAGE plpgsql;

    CREATE TRIGGER update_trg
    AFTER UPDATE ON foo
    FOR EACH ROW EXECUTE PROCEDURE update_trg_func();

    postgres=# UPDATE foo SET c = 'Pavel', a = 40;
    NOTICE: a 30 40
    NOTICE: c omega Pavel
    UPDATE 1
    Next function allows update of any dynamic record. This function is record_set_fields: It's a variadic functions - you can enter a fields and values to change:

    postgres=# SELECT pst.record_set_fields(row(0,0,'',0)::foo, 'd', 100, 'c', 'Hello');
    record_set_fields
    -------------------
    (0,0,Hello,100)
    It can be used for dynamic initialization of wide tables - for some OLAP purposes:
      
    CREATE TABLE t(a0 int, a1 int, a2 int, a3 int, a4 int, a5 int);

    -- set all null fields ax on value -1000

    CREATE OR REPLACE FUNCTION insert_trg_func()
    RETURNS trigger as $$
    DECLARE name text;
    BEGIN
    FOR name IN SELECT x.name
    FROM pst.record_expand(new) x
    WHERE x.value IS NULL AND x.name LIKE 'a%'
    LOOP
    new = pst.record_set_fields(new, name, -1000);
    END LOOP;
    RETURN new;
    END;
    $$ LANGUAGE plpgsql;

    CREATE TRIGGER insert_trg
    BEFORE INSERT ON t
    FOR EACH ROW EXECUTE PROCEDURE insert_trg_func();

    postgres=# INSERT INTO t(a3,a5) VALUES(100,100);
    INSERT 0 1
    postgres=# SELECT * FROM t;
    a0 | a1 | a2 | a3 | a4 | a5
    -------+-------+-------+-----+-------+-----
    -1000 | -1000 | -1000 | 100 | -1000 | 100
    (1 row)
    Last function is record_get_field. Its returns a value of entered field.

    CREATE OR REPLACE FUNCTION insert_trg_func()
    RETURNS TRIGGER AS $$
    DECLARE name text;
    BEGIN
    FOR i IN 0..5 LOOP
    IF pst.record_get_field(new, 'a'||i) IS NULL THEN
    new := pst.record_set_fields(new, 'a'||i, -1000);
    END IF;
    END LOOP; r
    RETURN new;
    END;
    $$ LANGUAGE plpgsql;
    These package is available from pgfoundry http://pgfoundry.org/frs/shownotes.php?release_id=1749.



    Leo Hsu and Regina Obe: PL/R Part 2: Functions that take arguments and the power of aggregation


    In Intro to PL/R and R, we covered how to enable PL/R language in the database and wrote some PL/R functions that rendered plain text reports using the R environment. What makes combining R and PostgreSQL in PL/R most powerful is when you can start writing SQL summary queries that use R functions like any other SQL function.

    In this next example, we'll be using PostGIS test runs from tests we autogenerated from the Official PostGIS documentation (Documentation Driven Testing (DDT)) as described in the Garden Test section of the PostGIS Developer wiki. We've also updated some of our logging generator and test patterns so future results may not represent what we demonstrated in the last article.

    On a side note: Among the changes in the tests was to introduce more variants of the Empty Geometry now supported by PostGIS 2.0. Our beloved PostGIS 2.0 trunk is at the moment somewhat unstable when working with these new forms of emptiness and stuffing geometries in inappropriate places. At the moment it doesn't survive through the mindless machine gun battery of tests we have mercilessly inflicted. It's been great fun trying to build a better dummy while watching Paul run around patching holes to make the software more dummy proof as the dummy stumbles across questionable but amusing PostGIS use cases not gracefully handled by his new serialization and empty logic.

    On yet another side note, it's nice to see that others are doing similar wonderful things with documentation. Check out Euler's comment on catalog tables where he uses the PostgreSQL SGML documentation to autogenerate PostgreSQL catalog table comments using OpenJade's OSX to convert the SGML to XML and then XSL similar to what we did with PostGIS documentation to autogenerate PostGIS function/type comments and as a platform for our test generator.

    For our next exercises we'll be using the power of aggregation to push data into R instead of pg.spi.execute. This will make our functions far more reusable and versatile.


    Continue reading "PL/R Part 2: Functions that take arguments and the power of aggregation"

    Andrew Dunstan: Learning by example.


    I asked someone today not to top-post on the hackers mailing list. He asked privately what that meant, and as I looked around briefly for links, I discovered this gem, which caused me some amusement:

    Arguments to support bottom-posting...

    A: Because we read from top to bottom, left to right.
    Q: Why should I start my reply below the quoted text?

    A: Because it messes up the order in which people normally read text.
    Q: Why is top-posting such a bad thing?

    A: The lost context.
    Q: What makes top-posted replies harder to read than bottom-posted?

    A: Yes.
    Q: Should I trim down the quoted part of an email to which I'm replying?




    Andreas Scherbaum: Second day of PGDay.Eu 2010


    Author
    Andreas 'ads' Scherbaum

    The second conference day started with stress: the schedule was wrong in the booklet, the first talks in the morning were scheduled with 20 minutes instead of 50. This was the reason why so many visitors showed up early ;-) We decided to cut the next talks 5 minutes short and take 5 or 10 minutes from the breaks. After lunch we were back in schedule again.

    Because of a PGEU board meeting - originally scheduled before the first talk, but then delayed because of the booklet error - I missed the first talk in the morning. Then I attended Stefan "mastermind" Kaltenbrunner's talk about "Benchmarking and performance tests for and with PostgreSQL" - and yes, I learned something new. The next slot in the same room was reserved for my talk about "Porting databases from MySQL to PostgreSQL". Not so many visitors as for my previous talk, but not everyone is interested in this topic anyway. We had some nice discussions and I got the impression that not everybody is aware about the amount of problems which raises such a project.

    After the excellent lunch I was room host for Gianni Ciolli and his talk about "Windows that Work: How to use complex SQL for the greater good" as well as for Dimitri Fontaine and his talk about "PostgreSQL extension's development". I learned that now it's a good time to rewrite my table_log extension and port it to the upcoming extension infrastructure.

    The last slot of the day was reserved for Ed Boyajian and his presentation about "PostgreSQL's time to shine". He included some very interesting numbers, you can find this (and most of the other) slides on the PG wiki.

    Dave and Magnus really got the last word ;-) They told us about the success for this year's conference, showed some numbers and then we thanked all speakers and helpers (especially Harald and the guys from imos).


    Continue reading "Second day of PGDay.Eu 2010"

    Joshua D. Drake: Looking for an Intermediate Sysadmin + Junior DBA


    Command Prompt, Inc. is looking for an intermediate systems administrator and junior PostgreSQL administrator.

    Command Prompt has provided PostgreSQL support, development, and hosting since 1997. We are looking for another person to join our stellar group of PostgreSQL systems experts.

    We seek someone who has a deep knowledge of at least one UNIX-like system, and who knows how to manage heterogeneous systems well. You can demonstrate strong skills in basic systems administration. You have a clear understanding of how to administer several systems that are mostly the same, but that have small local differences. If you do not feel comfortable saying, "I'm not familiar with that in your environment," then reading the man page, and coming back with a test plan for deployment that will reveal problems with your strategy, this is not a job for you. The ability to imagine, propose, test, and deliver perfectly-integrated and easily-maintained solutions to users' problems is the key to success in this position. You work well in a team: you don't have to work 24 hours a day, because other people can always log into any system you maintain and know where everything is, how it got there, and how to add new things if they're needed.

    You must know basic PostgreSQL administration, and understand how to set up, operate, and tune Postgres in elementary ways, such as installing from a package manager. You do not have to be a PostgreSQL guru, but if you are, that is a bonus.

    You have systems administration capability with shell scripts, as well as either Perl or Python. As well as understanding (or quick to pick up) technologies such as DRBD and LVM2.

    You have an interest in working on varied projects that use varied (and sometimes legacy) technologies.

    Command Prompt is a professional services company. You will be interfacing with customers, therefore customer service and professionalism is key.

    Command Prompt employees usually work from their home offices. The position normally requires minimal travel. For this position, the successful candidate will be able to communicate effectively in English. English as a spoken second language is fine as long as the written skills are clear and effective.

    This is a remote position and does not require residence in the United States.

    Applicants should send a resume to Joshua D. Drake, jd(at)commandprompt(dot)com, with the words "intermediate administrator" in the Subject line. Please include any text by way of a cover letter in the body of your email and not as a separate attachment.



    Selena Deckelmann: #PDX11: Conversation about VC, investing and trends in the Portland


    I wanted to continue talking about the perceptions stemming from the “quality of life” versus “financial success” juxtapositon from this slide:

    A friend pointed out that maybe this issue is being framed in the wrong way for “outsiders”. If we’re going to pitch people on Portland being the best place to build a startup, values are certainly part of the equation, but what do investors want to know?

    Maybe problem is similar to the process of gentrification… And as far as Portland’s software scene is concerned, we’re out of the blight, well into the “artists and weirdos make a home” phase and maybe just about to transition into “developers buy up a lot of land and artists start moving on” phase. But capital investment is in companies rather than land.

    Maybe the story we as a city need to tell is that we want early adopter investors and more “artists and weirdos” who are passionate about what they do.

    It’s a tough analogy, because we don’t have the same geographical or physical world indicators. When gentrification occurs, there aren’t always clear signs in the beginning. But as the process unfolds, people physically move in and out of a tight geographical area.

    To put a spotlight on what’s going on with the tech industry, we need for some better indicators! Have a look at the employment graphs like the Oregon tech job employment indicator:

    I don’t think it tells the whole story. There are also indicators about VC, but again, I don’t think it is capturing the nature of what is happening in Portland.

    To start, I’m interested in a finer-grained look at the jobs associated with small software and IT firms. I’m not sure if there’s a way to pull that data out of what is typically tracked, but I’m going to try.

    What indicators do you think we should be tracking?

    Related posts:

    1. #PDX11: Turning things up a notch for Portland’s “software cluster”
    2. #PDX11: It’s alive!
    3. #PDX11: Mayor Sam Adams’ intro to the software summit



    Devrim Gündüz: Attention : 9.0.1 **OR** RHEL 6 RPM users


    I have an important announcement for 9.0 RPM users.

    Continue reading "Attention : 9.0.1 **OR** RHEL 6 RPM users"

    Devrim Gündüz: PostgreSQL Clustering with Red Hat Cluster Suite slides are online


    I presented an updated version of talk at PGDay.EU 2010. You can download it here.

    Andreas Scherbaum: First day of PGDay.Eu 2010


    Author
    Andreas 'ads' Scherbaum

    The first day of PGDay:EU 2010 was cool. All the work done yesterday now paid out: registration for everyone took only seconds and nobody had to wait.

    The conference started with a welcome from our president and then with a keynote given by Simon Phipps.

    After the keynote I was room host for two talks in the german track, and I had the pleasure to announce a PostGIS talk given by Stefan Keller and Andreas Neumann, and a talk about PostgreSQL in Business, given by Marc Balmer.

    After a very good lunch, I was scheduled for the last talk on this day on the german track. The talk was about "Open Source Decisions": I asked several companies about the reasons why they are using open or closed source software (and some other related questions) and compiled an advocacy talk from the answers. Thanks to my audience for a very successfull discussion.

    In the evening, EnterpriseDB hosted the usual social event. There was good food and drinks and we had a room just for us. It was a good time to talk with old friends and join others for good discussions.



    Andreas Scherbaum: Preparing PGDay.EU 2010 in Stuttgart


    Author
    Andreas 'ads' Scherbaum

    It's this time of the year again: the Annual European PostgreSQL Conference (PGDay.EU 2010) is near.

    The last weeks were very busy. We had an online meeting every week, ordered some stuff, told our sponsors where to send the flyers and in general we had a very long todo list to solve. Last saturday and sunday, several team members started arriving in Stuttgart and we met up to drink some beers and actually find some time to prepare the conference ;-)

    This year we made the decision to have PGDay.EU 2010 in a hotel, not in an university like the previous years. The idea was to make the event more attractive for companies and authorities - and based on the registration numbers the visitors accepted the choice. After some search earlier this year we had chosen the SI-Center in Stuttgart for this years conference. The Center
    itself isn't just a building, it's a whole complex. Our rooms for the talks are distributed among two buildings, but luckily the visitors don't have to go outside (it's snowing) to change rooms. In addition the SI-Center is providing food and drinks so we don't have to take care of this part.


    Continue reading "Preparing PGDay.EU 2010 in Stuttgart"

    Selena Deckelmann: #PDX11: Community as competitive advantage


    Here’s an edited version of Thompson Morrison’s presentation about the software industry’s response to a series of surveys. The original presentation is available here. One of the key slides was about what folks here value:

    The point I appreciated about this clip is that Portland’s software community *is* our competitive advantage.

    I edited the video most for audio coherence. Sorry about the audio being a little out of sync.

    Related posts:

    1. #PDX11: The software summit wrapup
    2. #PDX11: Mayor Sam Adams’ intro to the software summit
    3. #PDX11: Conversation about VC, investing and trends in the Portland



    Selena Deckelmann: #PDX11: Mayor Sam Adams intro to the software summit


    I am working on editing out the pithier parts of the 1 hour 45 minute video that the city released of the PDX Software Summit.

    Below is Mayor Sam Adams’ introduction to the process and how the City is engaging the public to make the software strategy happen.

    Related posts:

    1. #PDX11: Turning things up a notch for Portland’s “software cluster”
    2. #PDX11: The software summit wrapup
    3. #PDX11: It’s alive!



    PostgreSQL Weekly News: PostgreSQL Weekly News December 5th 2010


    MicroOLAP Database Designer 1.3.0 for PostgreSQL released. http://microolap.com/products/database/postgresql-designer/

    Selena Deckelmann: Open Data Hackathon Day: ScraperWiki views


    Open data is really only as interesting as what we can do with it!

    One sweet thing about ScraperWiki is that it enables quick creation of visualizations called ‘views’ from inside the wiki. They’ve got templates that use Google Visualization to help the process along.

    I made the following today (from this datasource):

    I don’t have the entire data set, but this graph indicates that the recession had a significant negative impact on the creation rate of new businesses in Oregon.

    I just started a new scraper job to pull more information about people and the places where the businesses are located. When that job is done, I hope to create a few more fun visualizations with this data.

    UPDATE: I’m playing around more, and here’s the embedded version of the graph if you click through (takes a while to load!).

    Related posts:

    1. Open Data Hackathon Day: Oregon Business License Registry
    2. #PDX11: Conversation about VC, investing and trends in the Portland
    3. Looking for a few good OPML files!



    Selena Deckelmann: Open Data Hackathon Day: Oregon Business License Registry


    At the Portland Software Summit on Thursday, a couple people mentioned that it was hard to keep track of new businesses that pop up, and that merger and acquisition activity wasn’t being sufficiently publicised.

    I thought – maybe we could get this information in an automated way!

    I started with the state of Oregon’s business registry search site. Unfortunately, they limit search results for business searches to 1000, and they don’t paginate their results. So, we kicked ScraperWiki into gear, and wrote a very simple scraper with @maxogden: http://scraperwiki.com/scrapers/oregon_business_registry/

    Next, I wanted to find out information about businesses specifically in Portland. The City releases information about this, but in PDF form: http://www.portlandonline.com/omf/index.cfm?c=32192

    I wrote a quick and dirty Python script to scrape out information, and am getting probably 250 of the 300+ businesses in the November release. Next, I want to cross reference this data with what’s in the Oregon site. I’ll be publishing the Python scripts over the weekend. Hopefully ScraperWiki will add pyPDF to their Python repo support and I will be able to publish the transform there so it can be easily linked to the Oregon data.

    Two lessons today:

    • Governments: Please don’t publish data in PDFs. YUCK.
    • Governments: Please paginate results from your site! Hard limits are just kinda lame.

    The alternative to scraping the state of Oregon’s site is to order a CD-ROM for $50. I think this is such a stupid profit center for the state. I’d be interested to know how much money they’re really making off of it, and whether they could take a page out of Metro’s book and find a way to share the data with a different, more useful service.

    Related posts:

    1. Open Data Hackathon Day: ScraperWiki views
    2. Looking for a few good OPML files!
    3. #PDX11: Turning things up a notch for Portland’s “software cluster”



    Josh Berkus: RHEL 6 and the Return of wal_sync_method


    Red Hat Enterprise Linux 6 came out a couple weeks ago, and with it a new annoyance for PostgreSQL users. This story starts with PostgreSQL refusing to start up, and moves to with fiddling with a setting which few PostgreSQL DBAs have thought about in years: wal_sync_method. It ends, hopefully, with a PostgreSQL update release.

    Andrew Dunstan: Foreign Data Wrapper fun


    I've just started playing with the SQL/MED foreign data wrapper patch, and particularly the file_fdw module. First order of the day has been to get it applied against head, fix the bitrot, and get it building and passing regression tests. That's done so now I'm going to dig deeper and see if I can make it do one or two things I need. People interested in playing along can get the stuff from the sqlmed branch on my github repo.

    Hubert Lubaczewski: Auto refreshing password file for pgbouncer


    As you perhaps know I’m fan of pgbouncer – connection pooling solution for PostgreSQL. It can do many really cool things, but has one slight issue. Since it can reuse connections – it has to provide a way to check if user supplied password is correct without consulting database. And it lately (since 9.0 to [...]

    Selena Deckelmann: Looking for a few good OPML files!


    We’re looking for a few good OPML files!

    Tell us where you go to find information about tech in Portland! I’m interested in software companies, non-profits, developer blogs and culture resources that are relevant to tech businesses. Off the top of my head: Silicon Florist, Mark Rogoway’s blog, BikePortland and Sarah Mirk’s twitter feed all are relevant. Urban Airship, Puppet Labs, Bank Simple, Small Society and Cloud Four are all local businesses and startups I’m keeping an eye on.

    And what will I do with your contributions? I’m going to pick the best-of and curate a feed for the pdx11.org site.

    Let me know what’s in your feed reader below!

    Related posts:

    1. Open Data Hackathon Day: Oregon Business License Registry
    2. #PDX11: Turning things up a notch for Portland’s “software cluster”



    Selena Deckelmann: #PDX11: The software summit wrapup


    So, the software summit happened this evening. Mike Rogoway posted his brief overview here: http://blog.oregonlive.com/siliconforest/2010/12/portland_gathers_software_deve.html. The back of my head, Teyo and Amy are featured on the bottom of the photo :)

    I livetweeted the event, so tomorrow there’ll be a cavalcade of tweets here: http://www.chesnok.com/daily/2010/12/03/twittering-on-2010-12-03/

    Big ideas:

    • Portland is about Community. And our competitive advantage is that community.
    • Part of what makes us what we are is the fact that we stick with the same companies over time. What can we do to enable collaboration across company boundaries?
    • We cultivate enduring wealth.

    My takeaway from the Q&A was that we’ve got this huge biz/developer divide. We have management talent – but maybe not the big M&A talent that the VC-oriented folks are after. We have loads of developers. We need to make more effort to get those folks connected to one another in a way that makes sense!

    We have tons of events that are advertised on calagator.org, and the city is making an effort to make itself a hub (hosting Lunch 2.0, and now this software summit in the City Council Chambers). Overall, I really think the City is doing a great job, and I can’t wait to see what the citizen initiatives come up with.

    So if you want to get involved, subscribe to our lists:

    http://lists.pdx11.org/mailman/listinfo/pdx11-announce (announce only!)
    http://lists.pdx11.org/mailman/listinfo/pdx11-financing
    http://lists.pdx11.org/mailman/listinfo/pdx11-mentoring
    http://lists.pdx11.org/mailman/listinfo/pdx11-knowledge

    Find out more at: http://pdx11.org

    UPDATE:

    Portland Business Journal coverage!
    Slides from the City’s presentation
    Video from Software Summit

    Related posts:

    1. #PDX11: It’s alive!
    2. #PDX11: Mayor Sam Adams’ intro to the software summit
    3. #PDX11: Turning things up a notch for Portland’s “software cluster”



    Devrim Gündüz: My picks for PGDay.EU 2010.


    Less than 3 days left for PGDay.EU 2010.
    Continue reading "My picks for PGDay.EU 2010."

    Andrew Dunstan: Down Memory Lane


    My first large contribution to Postgres was a complete rewrite of initdb. Back before version 8.0, this was a shell script. But for the Windows port we needed it to be written in C, and without using all the utility commands which are available on Unix systems, like sed and grep. The original; was committed to the source by Bruce just over 7 years ago. Rereading the current source code briefly yesterday, I was mildly amazed to see how much of the original code was still there. It seems to have stood the test of time much better than I expected. Some of the comment are probably a bit redundant now, like
      I have no idea how to handle this. (Strange they call UNIX an application!)
      So this will need some testing on Windows.
    
    I think 7 years is long enough as a testing period :-)

    Peter Eisentraut: Development time


    Let's say you want to contribute to PostgreSQL development and want to play by the rules (which are actually not rules, but guidelines or encouragements), such as:
    1. While a release is in beta, you work on finalizing the release, not on future projects.
    2. During a commitfest, you work on testing and integrating the proposals submitted for the commitfest, not on new features.
    3. Major features should not be submitted for the first time at the last commitfest.
    The final release of PostgreSQL 9.0.0 was on 2010-09-20. By that time you already missed the first commitfest (2010-07), and the second commitfest (2010-09) was already under way. The second commitfest was actually slightly delayed and ended on 2010-10-26, whereas the third and next-to-last-for-9.1 commitfest (2010-11) started punctually on 2010-11-15.

    That means if, while being a team player on all the community efforts, you wanted to develop a major new feature for PostgreSQL 9.1, you had a total of about 20 days to do it. (That is, if you didn't spend several days in early November at PgWest.) And that is within a one-year release cycle.

    Selena Deckelmann: #PDX11: Its alive!


    Alright, PDX11 is alive!

    Tonight’s the big software summit! If you didn’t already sign up, it’s going to be live streamed, and I’m sure there will be tons of tweets and blog posts after. PDX11 will have followup information, and announcements related to the next steps each group involved will be taking.

    If you want to get a head start — sign up for the mailing lists!

    For announcements: http://lists.pdx11.org/mailman/listinfo/pdx11-announce

    For more information about today’s meeting (4:30pm at City Hall!) check out Mike Rogoway’s coverage: Portland wants to transform its software culture into an industry

    What else should we have on the site? Want to help make it better? Leave your thoughts in the comments.

    Related posts:

    1. #PDX11: The software summit wrapup
    2. #PDX11: Turning things up a notch for Portland’s “software cluster”
    3. #PDX11: Mayor Sam Adams’ intro to the software summit



    Andrew Dunstan: Multi-tenancy done right


    I was slightly surprised today to read the section in PostgreSQL 9.0 Administration Cookbook on setting up multiple servers on RedHat flavoured systems, and find it didn't mention the nice way this is supported by the init script. Here's how I have done it numbers of times in the past. Let's say your new service will be called "pg_myapp". First you set up a file with your new service details, in /etc/sysconfig/pgsql/pg_myapp like this:
    PGDATA=/var/lib/pgsql/myapp_data
    PGPORT=5436
    PGLOG=/var/lib/pgsql/myapp_pg_startup.log
    

    Now you create a link to the init script with the new name, using these commands:
    cd /etc/init.d
    ln -s postgresql pg_myapp
    

    and finally you initialise the data directory, turn the service on, and set it up to start just like your main postgres server:
    service pg_myapp initdb
    service pg_myapp start
    chkconfig pg_myapp on
    

    For good measure, you might want to define a connection to your new server, by adding a section to /etc/sysconfig/pgsql/pg_service.conf like this:
    [myapp_db]
    port=5436
    dbname=myapp
    

    Then, after you set up the myapp database in your new server, you can just connect to it by saying:
    psql service=myapp_db
    

    It all works quite smoothly. It might be nice to have something like Debian's utility for this, but really this is just a few minute's work.