deaddabe

Parsing real-world data with Rust: introducing the alias_all attribute in Serde

While data formats should always be unambiguous, real-world data from external providers often come with their set of issues. One of them is inconsistent naming of some fields. By introducing a new attribute in Rust’s famous serde library, we can however handle this case elegantly.

Context

I am trying to create a Rust library to parse the SIRI Lite data format. This format is a derivative of the SIRI XML format, but derived into JSON to be easier to consume by end-users — typically smartphone applications. The SIRI Lite format is not fully standardized yet. However, a specification proposal (PDF, fr) is proposed on the French specification website Normes Données TC.

The end use case is to query the Île de France Mobilités API Portal which acts as an aggregator for all public transporters in the French Île de France region. This portal allows to plan routes and departure times regardless of the many operators and their specificities.

The first step is to be able to parse the JSON answers of the API into Rust structures. Next, a query library has to be put in place in order to authentify to the API and send HTTP requests.

But problems already rise while playing with the API with the Web interface and manually retrieving JSON samples. Between the next stop of a bus and of a train station, I noticed that the two JSON answers were not exactly the same.

Bus JSON answer (truncated):

{
  "Order": 24,
   "StopPointName": [
     {
       "value": "GARE D'ORSAY VIL"
     }
   ],
   "VehicleAtStop": false,
   "DestinationDisplay": [
     {
       "value": "ORSAY VILLE RER"
     }
   ],
   "AimedArrivalTime": "2021-01-27T15:11:00.000Z",
   "ExpectedArrivalTime": "2021-01-27T15:13:52.000Z",
   "ArrivalStatus": "DELAYED"
 }

Train JSON answer (truncated):

{
  "destinationDisplay": [
    {
      "value": "GARE D'AUSTERLITZ"
    }
  ],
  "callNote": [],
  "facilityConditionElement": [],
  "situationRef": [],
  "aimedArrivalTime": "2021-01-23T21:35:20Z",
  "expectedArrivalTime": "2021-01-23T21:35:02Z",
  "arrivalStatus": "ON_TIME",
  "arrivalPlatformName": {
    "value": "2"
  },
  "arrivalOperatorRefs": [],
  "aimedDepartureTime": "2021-01-23T21:45:20Z",
  "expectedDepartureTime": "2021-01-23T21:45:02Z",
  "departureStatus": "ON_TIME",
  "departurePlatformName": {
    "value": "2"
  },
  "departureBoardingActivity": "BOARDING",
  "departureOperatorRefs": []
}

We observe that the train stop monitoring has way more data than the bus stop. This is expected, and supported by the SIRI Lite specification: most of the fields are optional anyways.

However, something that is not allowed by the specification is to use camelCase for field names: all fields should use PascalCase instead. After this manual query, we can see that requests for next trains return fields in camelCase.

The initial Rust structure that I have been using for parsing this JSON answer is the following:

// Stop monitoring
// Source: Proposition_Profil_SIRI_Lite-initial-v1-3.pdf, p. 20
#[derive(Deserialize, Debug)]
#[serde(rename_all(deserialize = "PascalCase"))]
pub struct MonitoredCall {
    pub order: Option<i32>,
    pub stop_point_name: Vec<Ref>,
    pub vehicle_at_stop: bool,
    pub destination_display: Vec<Ref>,
    pub aimed_arrival_time: Option<DateTime<Utc>>,
    pub expected_arrival_time: Option<DateTime<Utc>>,
    pub arrival_status: Option<ArrivalStatus>,
    pub aimed_departure_time: Option<DateTime<Utc>>,
    pub expected_departure_time: Option<DateTime<Utc>>,
}

However this was not able to parse the wrong fields of the train answers. I was able to parse both formats by adding one #[serde(alias = "<myField>") attribute for each field, but this is tedious and error prone. We can do better.

Cloning and patching Serde

Let’s clone the Serde library so that we can try to implement something new.

Using the clone locally is not that easy, because the serde repository contains multiple Rust crates that are linked together as dependencies. After a lot of trial and error, using the patch section of Cargo was the only solution that worked:

[dependencies]
serde = { version = "1.0", features = ["derive"] }
serde_json = { version = "1.0" }

[patch.crates-io]
serde = { path = "/home/user/dev/serde/serde" }
serde_json = { path = "/home/user/dev/serde_json" }

Attempts to directly use path in the dependencies section resulted in strange errors of finding the library, but not deriving the structures. Something that no search engine could help me out with.

Implementing global aliasing

The need is to be able to match both PascalCase and camelCase, while keeping the field names in snake_case to cope with established Rust naming conventions. I first tried to add support for multiple deserialize entries with the following syntax:

#[serde(rename_all(deserialize = "PascalCase", deserialize = "camelCase"))]

But this proved to be quite complex to implement: we should keep the first deserialize entry to rename all of the fields, and then use the other ones to add field aliases. This was very error-prone to implement as well as not quite clear that both renaming and aliasing were performed. However it allowed me to become familiar with the serde_derive crate's internals, which is very valuable to iterate for another implementation.

After looking into the list of opened issues of Serde, #1530 seems to quite match our need of multiple naming conventions support:

The rename_all attribute is very handy for cases where you have fields named in ways that don't follow Rust conventions. However, there are cases where you might want to support both ways - for example, loading data with support for both naming conventions, but preferring a different naming convention for saving that data later. I propose that an alias_all container attribute be added that functions similarly to rename_all […]

Thanks to this issue, the implementation idea is simplified. Instead of extending the rename_all attribute, introducing a new alias_all one will simplify a lot of things.

I first implemented this attribute so that only one RenameRule could be passed. After this was done and proven correct by a new test case, I added the possibility to pass multiple RenameRule because why not. And extended the test case to match for multiple alias_all.

The final implementation is proposed back to the serde community as a pull-request. The code changes are not that big (69 lines added).

And here is the final solution using this new serde attribute:

// Stop monitoring
// Source: Proposition_Profil_SIRI_Lite-initial-v1-3.pdf, p. 20
#[derive(Deserialize, Debug)]
#[serde(rename_all(deserialize = "PascalCase"), alias_all = "camelCase")]
pub struct MonitoredCall {
    pub order: Option<i32>,
    pub stop_point_name: Vec<Ref>,
    pub vehicle_at_stop: bool,
    pub destination_display: Vec<Ref>,
    pub aimed_arrival_time: Option<DateTime<Utc>>,
    pub expected_arrival_time: Option<DateTime<Utc>>,
    pub arrival_status: Option<ArrivalStatus>,
    pub aimed_departure_time: Option<DateTime<Utc>>,
    pub expected_departure_time: Option<DateTime<Utc>>,
}

This code is able to both parse DestinationDisplay (real name) and destinationDisplay (alias) from the JSON input.

Meanwhile, I have reported the wrong format issue to the API operator. While I did not receive any answer, querying another stop on the same train line returned me an answer with all fields in PascalCase like the specification says. At least, the awesome Serde library will be able to cope with this kind of issues in the future.

To be continued.