Geek; Programmer; Pythonista; FOSS enthusiast, evangelist and contributor; Melange developer; opinionated about programming languages; crazy about cars and air-planes; choosy in watching movies; loves Bangalore, San Francisco and Southern California; and most importantly addicted to coffee!
Madhusudan C.S.
map (thoughts) => words;   reduce (words) => this;
April 5, 2009, 5 a.m.
Django
GSoC
GSoC 2009

Title: Restructuring of existing Serialization format and improvisation of APIs

~~~~~~~~~
Abstract
~~~~~~~~~

Greetings!

I wish to provide Django, a better support for Serialization by building upon the existing Serialization framework. This project extends the format of the serialized output, by allowing to serialize the data in any custom format. It also supports automatic serialization of the related models for a given model. It extends the APIs to support above said changes. All the changes will be made so that the serialized data is useful for external processing too and doesn't break backward compatibility.
Content:

~~~~~~~~
Why?
~~~~~~~~

- The existing Serializer doesn't allow to Serialize in any user
defined custom format needed for external processing.
- It doesn't specify the name of the Primary Key(PK henceforth),
which is a problem for fields which are set as PKs (Ticket #10295)
using primary_key=True in the model definition.
- The existing format only specifies the PK of the related field,
using which it might not be possible to reconstruct the database
tables for those models which have integrity constraints
(Ticket #4656).
- There are no APIs for the above said requirement.
- The inherited models fields are not serialized.
- It is not possible to specify extra fields(i.e non-model
attributes, computed fields etc) with current implementation(Ticket
#5711)

~~~~~~~
How?
~~~~~~~

Let us consider the following two models for discussion through out:

  class Poll(models.Model):
      question = models.CharField(max_length=200)
      pub_date = models.DateTimeField('date published')
      creator = models.CharField(max_length=200)
      valid_for = models.IntegerField(max_length=200)

      def __unicode__(self):
          return self.question


  class Choice(models.Model):
      poll = models.ForeignKey(Poll)
      choice = models.CharField(max_length=200)
      votes = models.IntegerField()

      def __unicode__(self):
          return self.choice

This projects begins by providing ModelAdmin and Feeds framework
like APIs for Serializers where the user now will be able to construct
a class for specifying custom serialization formats. I propose the API
based on the following ideas.

The user will first define a Class inherited from the Serializer
framework. The parent class is a generic base Serializer class. The
user defined class is then passed as a parameter to the serialize
method we call when we want to serialize the Models. Within this class
the user will be able to specify the customized serialization format
in which he desires the output. Since Python supports majorly three
data structures, Lists, Tuples and Dictionaries, this format can
contain any of these data structures in any possible order. Examples:

Example 1:

  class PollSerializer(Serializer):
      custom_format = [("question", "valid_for", "id")]

The output in this case will be a list of tuples containing the values
of question, valid_for and id fields. Here the strings are the names
of the fields in the model.

OR
Example 2:

  class PollSerializer2(Serializer):
      custom_format = (["question", {
          "valid_for_number_of_days": "valid_for"
          "Poll ID": "id"
      }])

The output in this case will be a tuple of lists containing the values
of question and a dictionary which contains valid_for and id fields
as values and their description as keys of a dictionary.

The implementation although not trivial, will work as follows:
(This is not final. Final implementation will be worked out by
discussing with the community)
- The custom_format will be checked for the type. The top level
structure will be decided from this type. "{}" if dictionary, "()"
if tuple and "[]" if list. In case of XML, the root tag will be
django-objects. Also its children will have tag name as "object"
and include model="Model Name" in the tag. This is same as the
existing XML Serializer till here.

- Further the type of the only item within the top-level structure
is determined. All the django objects serialized will be of this
type. In case of XML, the children of "object" tag will be the tags
having the name "field". The tags will also have name="fieldname"
and type="FieldType" attributes within this tag. Additionally if
these field tags are items of the dictionary, they will have a
description="dictionary_key" attribute in the field tag.

- Further each item within the inner object("question","valid_for"
and "id" in the first example) is checked for the type and the
serialized output will have corresponding type. This is implemented
recursively from this level. In case of XML, however, the name of
the tag for further level groupings will have to be chosen in some
consistent way. My suggestion for now is to name the tags as
"field1" for the third level in the original custom format structure,
"field2" for the fourth level in the original custom format
structure, and so on.

For the second example above, we call the serializer as follows:

  serializer.serialize("json", Poll.objects.all(),
      custom_serializer=PollSerializer2)

The output looks as follows:

(
    ["What's Up?", {
        "valid_for_number_of_days": "30"
        "Poll ID": "1"
        }
    ],
    ["Elections 2009", {
        "valid_for_number_of_days": "60"
        "Poll ID": "2"
        }
    ]
)

Also if we use XML,

  serializer.serialize("xml", Poll.objects.all(),
      custom_serializer=PollSerializer2)

The output looks as follows:

<django-objects version="1.0">
    <object pk="1" model="testapp.poll2">
        <field type="CharField" name="question">What's Up?</field>
        <field>
            <field1 type="IntegerField" name="valid_for" description="valid_for_number_of_days">
                30
            </field1>
            <field1 type="AutoField" name="id" description="POLL ID">
                1
            </field1>
        </field>
    </object>
    <object pk="2" model="testapp.poll2">
        <field type="CharField" name="question">Elections 2009</field>
        <field>
            <field1 type="IntegerField" name="valid_for" description="valid_for_number_of_days">
                60
            </field1>
            <field1 type="AutoField" name="id" description="POLL ID">
                2
            </field1>          
        </field>
    </object>
</django-objects>

Further when a user wants to include extra fields in the serialized
data like additional non-model fields or computed fields, he needs
to specify the name of the method in the class that returns the value
of this field as the value of that item in his format. It should not
be a String. So that we can check if the item value is callable
and if so we can call that method and use the return value for
serialization. For example:

Example 3:

  class PollSerializer(Serializer):
      custom_format = [("question", "valid_for", till_date)]

      def till_date(self):
          import datetime
          delta_time = datetime.timedelta(
              days=Poll.objects.get(pk=self.pk).valid_for)
          new_datetime = Poll.objects.get(pk=self.pk).pub_date +
                             delta_time
          return new_datetime

Further an important thing to note here is that, whenever the string
passed as an item value to the custom_format anywhere in the whole
format doesn't evaluate to any field in the model, it is serialized as
the same string in the final output, thereby allowing addition of
non-model static data, such as version number of the format among
other things.

Another point to note here is that, the string specified in the
custom format can also include fields from the Parent Models, thereby
allowing even Parent Model fields to be serialized.

Further the user will be well informed in the docs that he cannot
pass any arbitrary Django object when calling the serialize()
method with custom_format parameter, but only the Objects of type
for which the custom_format is defined using the ModelSerializer class.
If he does so we it will be flagged as error.

Also last but not the least, a select_related parameter will be
added to the serialize method, upon setting to True will automatically
serialize all the related models for this model. Serializing the
related model facilitates the reconstruction of the database tables
for the given model in case there exists any constraints. Further
the related models will be serialized in a default format.

Further if user knows what models might be selected when
select_related is true, he can provide the parameter like below:

  related_custom_serializers={
      "Model1" : Model1Serializer
      "Model2" : Model2Serializer
  }

While Serializing the related models, the serializer checks to see
if related_custom_serializers have items for the selected model
and serializes in that format if it exists. Example:

  serializer.serialize("json", Poll.objects.all(),
      custom_serializer=PollSerializer2, select_related=True,
      related_custom_serializers={
      "Model1" : Model1Serializer
      "Model2" : Model2Serializer
      }
  )

(I am very skeptical about the use cases for the above feature, since
select_related is usually needed for round trips and rarely needed for
external applications. Nevertheless I propose it here, "Waiting for
further discussion")

~~~~~~~~~~~~~~~~~~~~~~
Benefits to Django
~~~~~~~~~~~~~~~~~~~~~~

By the end of this project, Django will have a better support for
Serialization. It serializes the related models which can be used
for reconstructing the database tables for the given model, there by
fixing ticket #4656. It also fixes #10295. Serialized data can be used
more conveniently in Django as well as external applications. And
finally Serializers will have better API support for all the newly
introduced features. The serialized data is made more generic by
allowing to specify in any arbitrary format.

~~~~~~~~~~~~~~~
Deliverables
~~~~~~~~~~~~~~~
1. Internal implementation and code for specifying custom serializer
class which includes specifying custom serialization format.
2. APIs to support above said features.
3. Also API and internal implementation for select_related parameter
and related_custom_serializers.
4. Test Cases for all the newly introduced features.

Non-Code deliverables include testing performed at various phases to
verify the correctness and backwards compatibility. Also detailed user
and development documentation for using the new Serializer
implementations.

~~~~~~~~~~
When?
~~~~~~~~~~

The project is planned to be completed in 7 phases. Every phase
includes documenting the progress during that phase. The timeline for
each of these phases is given below:
1. Design Decisions and Initial preparation(Community Bonding Period : Already started -
May 22nd )
Closely working with Django community to learn more about
Django in depth, reading documentations related to Django
internals, reading and understanding the code base of ORM and
Serializers in depth, reading about other system's Serializers.
Communicating and discussing with the community about the
outstanding issues. Design decisions I propose are discussed
and finalized.

2. Build and Test Phase I (May 22th – June 4th )
Implement the custom serializer class, adapting
serializer.serialize() method to accommodate this parameter.
Also implement custom_format parsing and building data for
basic default scenario(Example 1 type only). Add Test Cases.

3. Build and Test Phase II (June 5th – June 25th )
Implement custom_format to support multilevel recursive use
of datastructures(Example 2 types). Add test cases.

4. Build and Test Phase III (June 26th – July 9th )
Implement support for callables for inclusion of extra fields
(Example 3). Add test cases.

5. Build and Test Phase IV (July 10th – July 26th )
Implement select_related and related_custom_serializers and
relevant APIs. Add test cases.

6. Requesting for community wide Reviews, testing and evaluation
(July 27th – August 2nd )
Final phase of testing of the overall project, obtaining and
consolidating the results and evaluation of the results.
Requesting community to help me in final testing.

7. Scrubbing Code, Wrap-Up, Documentation (August 3rd – August 10th )
Fixing major and minor bugs if any and merging the project
with the Django SVN Trunk. Writing User and Developer
documentations and finalization.

~~~~~~~~~~
Where?
~~~~~~~~~~

I am already comfortable with the django-devel mailing-list and
IRC channel #django-dev@freenode.net. I will be able to contact my
mentor in both of the above two ways and will also be available
through google-talk(jabber). I am also comfortable with svn, git and
mercurial since I was the SVN administrator for 2 academic projects
and git administrator for 1 project.

~~~~~~~~~~~~
Why Me?
~~~~~~~~~~~~

I am a 4th Year undergraduate student pursuing Information Science
and Engineering as a major at BMSCE, Bangalore, India(IST). Have been
using and advocating Free and Open Source Softwares from past 5 years.
Have been one of the main coordinators of BMSLUG. Have given various
talks and conducted workshops on FOSS tools:
- Most importantly, recently I conducted a Python and *Django*
workshop for beginners at NIT, Calicut, a premium Institution
around.
- How to contribute to FOSS? - A Hands-On hackathon using GNUSim8085
as example.
http://groups.google.com/group/bms-lug/browse_thread/thread/0c9ca2367966...
- Have been actively participating in various FOSS Communities by
reporting bugs to communities like Ubuntu, GNOME, RTEMS, KDE.
- I was a major contributor and writer of the KDE's first-ever
Handbook.
http://img518.imageshack.us/img518/9796/hb1o.png
http://img518.imageshack.us/img518/4296/hb2.png

I have been contributing patches and code to various FOSS communities,
major ones being:
- GNUSim8085 (http://is.gd/p5wZ , http://is.gd/p5xK)
- KDE Step (http://is.gd/oci7)
- RTEMS
- Melange (The GSoC Web App.
http://code.google.com/p/soc/source/browse/trunk/AUTHORS)

My Django Work:
I was interested in contributing to Django even before GSoC flashed to
me. Discussed with David Crammer about Ticket #373 on #django-dev. I
read the Django ORM code required for that, but could not write any
code myself. Thanks to University coursework. I have had some
discussions about fixing ticket #8161 on django-devel list
(http://is.gd/obr2) but unfortunately it was fixed. So I am applying
for GSoC as I feel it lowers the barrier to get started.
http://groups.google.com/group/django-developers/browse_thread/thread/54...

I have a fair understanding of concepts of Python and have One and
half years of Python experience. I have a fair understanding on Django
ORM code because of my previous work. I am getting used to
Serialization Code as I am writing this proposal and have no problems
with it. Also I am using Django from 1 year for some of my Webapps.

Since I have been working with FOSS communities from some time I
have a good understanding of FOSS Development methodologies of
communicating with people, using Ticket tracker of Django, coding and
testing.

Lastly I want to express my deep commitment for this project and
Django. I'm fully available this summer without any other commitments,
will tune my day/night rhythm as per my mentor's requirement and
assure a dedicated work of 35-40 hours/week. Also I will assure that I
will continue my commitments with Django well after GSoC. If you find
any part of this proposal is not clear please contact me.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Important Links and URLs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[1] My Blog: http://madhusudancs.info
[2] My CV : http://www.madhusudancs.info/sites/default/files/madhusudancsCV.pdf
[3] http://groups.google.com/group/django-developers/browse_thread/thread/145e2b7ec53a1996
[4] http://groups.google.com/group/django-gsoc/browse_thread/thread/145e2b7ec53a1996

comments powered by Disqus