A Django application that provides a convenient way to export data from your Django models to Google BigQuery.
- Exports Django model data to BigQuery tables
- Processes data in configurable batch sizes to manage memory usage
- Handles date/time formats and UUID fields automatically
- Allows custom field transformations with a simple decorator
- Validates that model fields match BigQuery table schema
- Provides retry mechanisms for resilient exports
- Supports incremental exports with date filtering
- Handles potential exceptions during data export with detailed error reporting
pip install django-bigquery-exporter- Python 3.8+
- Django
- google-cloud-bigquery
- google-api-python-client
You need to authenticate with Google Cloud to use BigQuery. There are two main ways:
-
Using environment variables (recommended for production):
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json"
-
Providing credentials directly in code (useful for development):
exporter = MyExporter( project="your-google-cloud-project-id", credentials="/path/to/your/credentials.json" )
Create a subclass of BigQueryExporter and define the necessary attributes:
from bigquery_exporter.base import BigQueryExporter, custom_field
class BookExporter(BigQueryExporter):
model = Book
fields = ['id', 'title', 'author', 'publication_date', 'genre', 'rating']
batch = 1000
table_name = 'your_project.your_dataset.books'
replace_nulls_with_empty = False
@custom_field
def genre(self, instance):
"""Custom field to transform the genre into a structured format"""
return {
'code': instance.genre,
'name': instance.get_genre_display()
}Then, export the data:
exporter = BookExporter()
exporter.export()| Property | Description | Default |
|---|---|---|
model |
Django model to export (required) | None |
fields |
List of field names to export (required) | [] |
batch |
Number of records to process in each batch | 1000 |
table_name |
Full BigQuery table name (required) | '' |
replace_nulls_with_empty |
Whether to replace None values with empty strings |
False |
include_pull_date |
Whether to include pull date in the export | False |
pull_date_field_name |
Name of the field to store the export timestamp | 'pull_date' |
Define the queryset to export. Override this method to filter or order your data.
def define_queryset(self):
# Only export books published in the last year
one_year_ago = datetime.date.today() - datetime.timedelta(days=365)
return self.model.objects.filter(publication_date__gte=one_year_ago).order_by('id')IMPORTANT: When using batching (batch ≠ None), you MUST define proper ordering in your queryset to ensure consistent pagination results.
Export data to BigQuery.
pull_date: Optional timestamp to record when the data was exported (only included ifinclude_pull_date=True)queryset: Optional queryset to override the default. Useful for backfilling specific data.
# Standard export
exporter = BookExporter()
errors = exporter.export()
# Export with specified pull_date
from datetime import datetime
exporter.export(pull_date=datetime.now())
# Backfilling specific data
historical_queryset = Book.objects.filter(
publication_date__year=2020
).order_by('id')
exporter.export(queryset=historical_queryset)
if errors:
print(f"Encountered {len(errors)} errors during export")Check if the BigQuery table has data. When both pull_date is provided AND include_pull_date is True, it checks for data with that specific pull date. Otherwise, it just checks if the table has any data at all.
exporter = BookExporter()
# Check with explicit pull date (only works if include_pull_date=True)
pull_date = datetime.datetime.now()
if not exporter.table_has_data(pull_date):
exporter.export(pull_date=pull_date)
else:
print("Data already exported for today")
# Check for any data
if not exporter.table_has_data():
exporter.export()
else:
print("Table already has data")Django BigQuery Exporter supports dependency injection for better testability and flexibility:
# Injecting a custom BigQuery client
from google.cloud import bigquery
custom_client = bigquery.Client(project='my-project')
exporter = BookExporter(
client=custom_client
)Use the @custom_field decorator to create methods that transform data during export:
@custom_field
def full_name(self, instance):
return f"{instance.first_name} {instance.last_name}"
@custom_field
def category_details(self, instance):
# Return complex nested data
return {
'id': instance.category_id,
'name': instance.category.name,
'parent': instance.category.parent.name if instance.category.parent else None
}- ALWAYS define an ordering in
define_queryset()when using batching - this is critical for consistent results - Set appropriate batch sizes based on your model's complexity
- Use custom fields to preprocess data before export
- Implement idempotency checks with
table_has_data() - Use the
querysetparameter for backfilling historical data rather than modifying your exporter class - Consider using dependency injection for the BigQuery client for better testability
- Catch and handle
GoogleAPICallErrorandBigQueryExporterErrorexceptions
Here's a complete example with a Book model:
import datetime
from bigquery_exporter.base import BigQueryExporter, custom_field
from myapp.models import Book
class BookExporter(BigQueryExporter):
model = Book
batch = 1000
table_name = 'my_project.bookstore.books'
fields = [
'id', 'title', 'author', 'publication_date', 'is_bestseller',
'genre', 'page_count', 'created_at', 'updated_at', 'rating'
]
# Pull date configuration
include_pull_date = True # Include pull date in the export
pull_date_field_name = 'export_date' # Custom field name
def define_queryset(self):
# Only export books updated in the last 30 days
thirty_days_ago = datetime.date.today() - datetime.timedelta(days=30)
return Book.objects.filter(updated_at__gte=thirty_days_ago).order_by('id')
@custom_field
def genre(self, instance):
"""Return both the code and display name for the genre"""
GENRES = {
'SFF': 'Science Fiction & Fantasy',
'MYS': 'Mystery',
'ROM': 'Romance',
# ... other genres
}
return {
'code': instance.genre,
'name': GENRES.get(instance.genre, 'Unknown')
}
@custom_field
def rating(self, instance):
"""Calculate and return the average rating"""
avg_rating = instance.reviews.aggregate(avg=Avg('rating'))['avg'] or 0
return round(avg_rating, 1)
# In a task or management command
def export_books_to_bigquery():
pull_date = datetime.datetime.now()
exporter = BookExporter(
project='my-gcp-project',
credentials='/path/to/credentials.json'
)
# Check if data already exists for today
if exporter.table_has_data(pull_date) and not force_export:
print(f"Data already exists for {pull_date.date()}, skipping export")
return
# Perform the export
errors = exporter.export(pull_date=pull_date)
if errors:
print(f"Export completed with {len(errors)} errors")
else:
print(f"Successfully exported books to BigQuery")The export() method returns a list of error objects for any failed row insertions. Each error includes:
- The row index
- The error message
- The affected data
You can use this information to log errors or retry specific records.
This project is licensed under the MIT License - see the LICENSE file for details.