Updating Data Files: Commits vs. Pull Requests?

August 18, 2021
3 min read

Likes ...

Comments ...

For once, I'm wondering a bit if this article can be helpful to somebody else since I believe my context is pretty specific. Anyway, just in case it might be the helpful case, here it is!

My Jet Train project makes use of GTFS. GTFS stands for General Transit Feed Specification. It models public transportation schedules and their associated geographic information.

GTFS is based on two kinds of data, static data, and dynamic data. Static data may change but do so rarely, e.g., transit agencies and bus stations. They are available as static files that you need to download now and then. Before, I had to download and overwrite them every time I run the demo.

As a developer, I'm lazy and wanted to automate this task. I used GitHub Actions for that:

name: Refresh Dataset
on:
  schedule:
    - cron: '12 2 * * 1'                                                     # 1
jobs:
  build:
    name: Refresh Dataset
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2                                            # 2
      - name: Fetch dataset archive
        run: curl -o archive.zip https://api.511.org/transit/datafeeds\?api_key\=${{secrets.FIVEONEONE_API_KEY}}\&operator_id\=RG  # 3
      - name: Extract archive
        run: unzip -o -d ./infrastructure/data/current/ archive.zip          # 4
      - name: Add & commit
        uses: stefanzweifel/git-auto-commit-action@v4                        # 5
        with:
          commit_message: Update to latest data files
          add_options: '-u'

Run the action weekly
Checkout the repository
Get the static data files archive
Extract files from the archive
Use the git-auto-commit action

It's not an issue to commit directly. Indeed, it's not code but data. The code should already have all built-in safeguards to prevent unexpected data from causing exceptions at runtime. I already had a couple of surprises previously and applied a lot of defensive programming techniques.

Yet, I was not happy with the above automation:

Commits happen every week, regardless of whether I need to run the demo or not. It creates a lot of unnecessary commits. That's the reason I scheduled the action weekly and not more often.
The action is scheduled on Mondays. If I run the demo on a Friday, I'll need to update the data files anyway.

Hence, I decided to switch to an alternative approach. Instead of committing, I updated the script to open a Pull Request. If I need to run the demo, I'll merge it (and pull locally); if not, it will stay open. If an opened PR already exists, the action will overwrite it. Now, I can schedule the action more frequently.

name: Refresh Dataset
on:
  schedule:
    - cron: '12 2 * * *'                                                     # 1
jobs:
  build:
    name: Refresh Dataset
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2                                            # 2
      - name: Fetch dataset archive
        run: curl -o archive.zip https://api.511.org/transit/datafeeds\?api_key\=${{secrets.FIVEONEONE_API_KEY}}\&operator_id\=RG  # 3
      - name: Extract files of interest from the archive
        run: unzip -o -j archive.zip agency.txt routes.txt stop_times.txt stops.txt trips.txt -d ./infrastructure/data/current  # 4
      - name: Remove archive
        run: rm archive.zip                                                  # 5
      - name: Create PR
        uses: peter-evans/create-pull-request@v3                             # 6
        with:
          commit-message: Update to latest data files
          branch: data/refresh
          delete-branch: true
          title: Refresh data files to latest version
          body: ""

Run the action daily
Checkout the repository
Get the static data files archive
Extract only required files from the archive
Remove the archive file for cleanup
Use the create-pull-request action. The action creates a PR that automatically contains all new and updated files; that's the reason why I only extract some files and remove the archive.

As I mentioned in the introduction, I'm not sure this post can help many people. If it does, please don't hesitate to comment to let me know about your use case!

The complete source code for this post can be found on Github.

Originally publish at A Java Geek on August 15^th, 2021

August 18, 2021
3 min read

Likes ...

Comments ...

DevOps
Tools

Nicolas Frankel

Author

Technologist focusing on cloud-native technologies, DevOps, CI/CD pipelines, and system observability. His focus revolves around creating technical content, delivering talks, and engaging with developer communities to promote the adoption of modern software practices. With a strong background in software, he has worked extensively with the JVM, applying his expertise across various industries. In addition to his technical work, he is the author of several books and regularly shares insights through his blog and open-source contributions.

Project Panama for Newbies (Part 1)

SpringBoot 3.2 + CRaC

The Java Story: A Film About All of Us

New Between-Quarters Security Updates for Java: What CSPUs Mean for Your Release Pipeline

Toward a Durable Spring PetClinic

First Test of Java on Banana Pi (ARM and RISC-V), Plus a Blinking LED with Pi4J

Creating Scalable OpenAI GPT Applications in Java

Foojay Podcast #92: Java 26 Is Here: What’s New, What’s Gone, and Why It Matters in 2026

Temporal Is to Your Code What a Database Is to Your Data

🤖 5 Best Practices for Working with AI Agents, Subagents, Skills and MCP

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Project Panama for Newbies (Part 1)

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Preparing for Spring Framework 7 and Spring Boot 4

Foojay Slack: bit.ly/join-foojay-slack

Free eBook: Sustainability for Java Developers

Cut Code Review Time & Bugs in Half. Instantly.

Modernizing Java with Jakarta EE 11

Updating Data Files: Commits vs. Pull Requests?

Nicolas Frankel

Nicolas Frankel

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Digma

adesso

Trending

Cut Code Review Time & Bugs in Half. Instantly.

Free eBook: Sustainability for Java Developers

Modernizing Java with Jakarta EE 11

Comments (0)

Free eBook: Sustainability for Java Developers

Cut Code Review Time & Bugs in Half. Instantly.

Modernizing Java with Jakarta EE 11

Do you want your ad here?

Updating Data Files: Commits vs. Pull Requests?

Nicolas Frankel

Nicolas Frankel

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Digma

adesso

Trending

All 0 Likes

Cut Code Review Time & Bugs in Half. Instantly.

Free eBook: Sustainability for Java Developers

Modernizing Java with Jakarta EE 11

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with