Millón de Monos

Weblog de Manuel Aristarán

Extrayendo datos reutilizables del Sitio del Ciudadano

Los presupuestos públicos son conjuntos de datos multidimensionlales, que reflejan las estructuras burocráticas, contables y económicas de los gastos e ingresos de la administración del estado. Su complejidad presenta desafíos interesantes a la hora de construir herramientas que permitan explorarlos.

Aunque los presupuestos de cualquier nivel del estado son muy similares en estructura, casi todos los países publican en la web sus presupuestos a través de herramientas ad-hoc. Este panorama fue relevado en detalle por Jonathan Stray en el reporte Open Budget Data: Mapping the Landscape. La fundación Open Knowledge International, por su parte, impulsa el proyecto Fiscal Data Package, que aprovecha esa similaridad estructural para estandarizar el formato en que se publican los datos públicos fiscales.

Argentina hace su parte desde hace algunos años con el Sitio del Ciudadano, dependiente del Ministerio de Hacienda y Finanzas Públicas. Pese a su título, nos cuesta imaginar a un ciudadano común navegando con éxito esta herramienta, debido a su mal diseño, complejidad y bajísima performance.

Durante mi tesis de maestría trabajé en SpendView, un prototipo de herrramienta para visualizar información presupuestaria. Uno de las premisas de su diseño es ser lo suficientemente flexible para almacenar y permitir explorar cualquier presupuesto público. Naturalmente, me interesaba mostrar el presupuesto argentino. SpendView requiere que los datos estén representados de manera desagregada. Es decir, cada línea del presupuesto debe contener información sobre todas las dimensiones en que se clasifica. En el caso del presupuesto asignado al presupuesto del CONICET, tomamos una línea del presupuesto bastante desagregada (actualizada a marzo de 2016):

  • Clasificación Administrativa (¿quién gasta?)
  • Clasificación según Objeto del Gasto (¿en qué se gasta?)
    • Gastos en Personal (nivel Inciso)
    • Personal Permanente (nivel Partida Principal)
  • Clasificación Funcional (¿para qué se gasta?)
    • Servicios Sociales (nivel Finalidad)
    • Ciencia y Técnica (nivel Función)
  • Clasificación según Fuente de financiamiento
    • Tesoro Nacional
  • Medidas
    • Crédito Vigente: 4.65 miles de millones de pesos argentinos
    • Devengado: 1.21 miles de millones de pesos argentinos.

Desafortunadamente, el Sitio del Ciudadano sólo ofrece cuadros pre-agregados. Es decir, no es posible obtener líneas completas del presupuesto, que refieran a todos los criterios de clasificación. Pero la data está, y se puede extraer.

Hurgando en el Sitio del Ciudadano (SiCi)

El SiCi está implementado en una versión antigua de Oracle Business Intelligence, un sistema muy poco apto para construir sitios web públicos. Por ejemplo, para obtener una tabla de gastos por jurisdicción, el browser hace 388 pedidos al servidor (!), transfiere 4.2 MB y (en mi computadora y con mi conexión a internet) tarda 27 segundos en mostrar el contenido.

Lentísimo

Pero entremezclada en su verborragia, el SiCi emite información que nos permitirá obtener los datos que necesitamos. El primer indicio es un request que ocurre cuando se interactúa (hover, click, etc) sobre una tabla. El browser pide un recurso llamado /saw.dll?getReportXmlFromSearchID, que contiene la definición del reporte solicitado:

Confundidos dentro de semejante aberración, hay elementos interesantes que contienen las fórmulas para cada columna del reporte:

<saw:columnFormula>
  <sawx:expr xsi:type="sawx:sqlExpression">Institucion."Cod. y Desc. Jurisdiccion"</sawx:expr>
</saw:columnFormula>
<!-- ... -->
<saw:columnFormula>
  <sawx:expr xsi:type="sawx:sqlExpression">CAST(Tiempo.Mes as VARCHAR(2))</sawx:expr>
</saw:columnFormula>
<!-- ... -->
<saw:columnFormula>
  <sawx:expr xsi:type="sawx:sqlExpression">"Indicadores Credito"."$ Cred. Vigente"</sawx:expr>
</saw:columnFormula>

Como uno de los elementos principales, al principio del archivo, aparece <saw:criteria subjectArea="&quot;SITIO DEL CIUDADANO&quot;" >, que no es otra cosa que la tabla/cubo sobre la que opera el reporte.

Luego de un exhaustivo proceso de investigación (busqué en Google), vi que el recurso HTTP /saw.dll es el punto de entrada a casi todas las operaciones que ofrece este sistema de business intelligence. Grande fue mi felicidad cuando vi en la documentación que saw.dll acepta un parámetro llamado SQL. Resulta que ese supuesto SQL es una extensión de Oracle, diseñada para hacer consultas OLAP (analíticas). En pocas palabras, es un SQL donde el SUM() sobre las medidas y el GROUP BY sobre las dimensiones están implícitos (me gustó la idea, ojalá hubiera una implementación open source).

Probamos el endpoint saw.dll con una consulta simple (averigüé los nombres de las columnas mirando la definición del reporte):

SELECT "Ejercicio Presupuestario"."Cod. Ejercicio Presupuestario",
       "Sector Institucional"."Desc. Caracter",
       "Indicadores Credito"."$ Comprometido",
       "Indicadores Credito"."$ Devengado",
       "Indicadores Credito"."$ Pagado",
       "Indicadores Credito"."$ Cred. Vigente"
FROM "SITIO DEL CIUDADANO"

El URL completo es el siguiente (el nombre de usuario y el password están visibles en el HTML del SiCi)

http://sitiodelciudadano.mecon.gov.ar/analytics/saw.dll?Go
  &NQUser=usrsici_c
  &NQPassword=usrsici_c
  &SQL=SELECT%20%22Ejercicio%20Presupuestario%22.%22Cod.%20Ejercicio%20Presupuestario%22,%20%22Sector%20Institucional%22.%22Desc.%20Caracter%22,%20%22Indicadores%20Credito%22.%22$%20Comprometido%22,%20%22Indicadores%20Credito%22.%22$%20Devengado%22,%20%22Indicadores%20Credito%22.%22$%20Pagado%22,%20%22Indicadores%20Credito%22.%22$%20Cred.%20Vigente%22%20FROM%20%22SITIO%20DEL%20CIUDADANO%22

Boom. La historia de la ejecución presupuestaria desde 1998, desagregada por “Sector Institucional”. Para mi sorpresa, funcionó a la perfección, dibujando una tabla que parece diseñada en 1999, y cuyo HTML parece escrito con Microsoft FrontPage ‘98:

Consulta simple

Pero queremos CSV, no una tabla horrible. Volvemos a la documentación, y vemos un parámetro Format. Agregamos Format=CSV al URL, que nos devuelve un hermoso archivo separado por comas.

Vamos a omitir muchos pasos intermedios, para pasar directamente al modelo terminado. La siguiente consulta obtiene la ejecución presupuestaria a la fecha en 2016, a una resolución más alta de la que podíamos esperar:

SELECT "Ejercicio Presupuestario"."Cod. Ejercicio Presupuestario",
       "Sector Institucional"."Desc. Caracter",
       "Institucion"."Cod. Jurisdiccion",
       "Institucion"."Desc. Jurisdiccion",
       "Institucion"."Cod. Subjurisdiccion",
       "Institucion"."Desc. Subjurisdiccion",
       "Institucion"."Cod. Entidad",
       "Institucion"."Desc. Entidad",
       "Servicio"."Cod. Servicio",
       "Servicio"."Desc. Larga Servicio",
       "Apertura Programatica"."Cod. Programa",
       "Apertura Programatica"."Desc. Programa",
       "Finalidad Funcion"."Cod. Finalidad",
       "Finalidad Funcion"."Desc. Finalidad",
       "Finalidad Funcion"."Cod. Funcion",
       "Finalidad Funcion"."Desc. Funcion",
       "Objeto Gasto"."Cod. Inciso",
       "Objeto Gasto"."Desc. Inciso",
       "Objeto Gasto"."Cod. Principal",
       "Objeto Gasto"."Desc. Principal",
       "Objeto Gasto"."Cod. Parcial",
       "Objeto Gasto"."Desc. Parcial",
       "Objeto Gasto"."Cod. Subparcial",
       "Objeto Gasto"."Desc. Subparcial",
       "Clasificador Economico"."Cod. 2 Digitos",
       "Clasificador Economico"."Desc. 2 Digitos",
       "Clasificador Economico"."Cod. 3 Digitos",
       "Clasificador Economico"."Desc. 3 Digitos",
       "Fuente Financiamiento"."Cod Codigos",
       "Fuente Financiamiento"."Cod y Desc Codigos",
       "Indicadores Credito"."$ Comprometido",
       "Indicadores Credito"."$ Devengado",
       "Indicadores Credito"."$ Pagado",
       "Indicadores Credito"."$ Cred. Vigente"
FROM "SITIO DEL CIUDADANO"
WHERE "Ejercicio Presupuestario"."Cod. Ejercicio Presupuestario"=2016
  AND ("Objeto Gasto"."Cod. Inciso" BETWEEN 1 AND 8)
  AND "Clasificador Economico"."Cod. 2 Digitos" IN (21, 22)

(Los filtros que aparecen en la cláusula WHERE fueron tomados de los reportes)

La tabla/cubo SITIO DEL CIUDADANO contiene más dimensiones, que nos permitirían obtener la ejecución presupuestaria a una resolución de días, para cualquier fecha desde 1998.

Obtener esa información, y construir reportes o herramientas interesantes, queda como ejercicio para el lector entusiasta.


Digital public services — User experience matters

Originally posted on FOLD.cm


On October 1st 2013, millions of Americans rushed to their computers and navigated to healthcare.gov, a website released by the federal government that would let them shop for health insurance plans. This digital public service was a critical part of the implementation of President Obama’s most progressive reform, and first significant modification to the US public health system since the 1960s; the Patient Protection and Affordable Care Act

But things did not go as planned. As people tried to navigate the site, it crashed, ran slowly, and was difficult to use. Ultimately, it failed to deliver its promise of providing citizens with an easy way of buying health insurance. It seemed that the most banal of causes —a computer glitch— posed a serious threat to a public policy that took Congress many years of political struggle to push forward

The healthcare.gov debacle is just one example of how governments all over the world often fail to build usable, effective, and modern digital services. Similar stories are abundant: the online enrollment system for Buenos Aires’ schools has routinely failed for the last few years when parents try to secure a place for their children at their neighborhood school. The website of the Argentine tax agency is famously ugly and difficult to use, and most of its services only work with Internet Explorer even when a one-line patch would remedy that issue. As an Argentine web developer, I cringe every time I need to go in and pay my taxes.

Public outrage about these issues is reasonable. Most of us carry in our pockets an incredibly powerful computer rigged with all kinds of sophisticated sensors, which works 99% of the time. We buy products and services from our connected devices. We video-chat, Dick Tracy style, with our friends on the other side of the world while riding the subway. But still, we see how simple digital public services, often as simple as storing the contents of a form in a database, are near-impossible to navigate, and fall under the pressure of traffic typically considered light by a mid-sized website.

Why governments fail so often at building effective digital public services? The short answer is that building good software is hard. The intangible nature of computer programs, makes the complexity of building them difficult to comprehend. We expect our phone to just work, but it took Google 10 years, billions of dollars, and an astronomical amount of combined human-hours to develop the latest version of Android, the operating system that runs in 82% of the smartphones in the world. And they didn’t even start from scratch: Android runs on Java, a programming language which legacy goes back to the pioneers of computer science. But your phone still freezes from time to time.

Can we expect the understaffed and underfunded technology office of a city’s government to build a reliable online system? Yes, in the same way that we expect buses to run on time and streets to be pothole-free. But in order to do so, administrations should tweak a few aspects of their technology game.

Public procurement, a process burdened with impenetrable bureaucracy, is often cited among the causes that make public technology projects crash and burn. Clay Johnson, former head of the research arm of the open government watchdog Sunlight Foundation, has written extensively about this. In his 2013 New York Times op-ed about healthcare.gov, he says that “large federal information technology purchases have to end. Any methodology with a 94 percent chance of failure or delay, which costs taxpayers billions of dollars, doesn’t belong in a 21st-century government.”.

An undesirable side-effect of the intricate process of technology acquisition by the public sector, is that vendors that are awarded with contracts are those who are good at the tendering game, but not necessarily good at building user-facing software. Time and time again we see the usual corporate software vendors being awarded multi-million dollar contracts to build user-facing public services. Those companies build the pieces of infrastructure that power many mission critical systems, but they aren’t the ones designing the user interfaces that we love and use everyday. They’re not good at building experiences, and it shows.

Beautiful and engaging user interfaces are built by a very different kind of organization. The teams responsible for the polished user experience of Facebook, Twitter, or Tinder are small, nimble, and unencumbered by the rigid hierarchies and chains of command present in government. They’re breaking new ground, not just “modernizing” a decades-old process. Conway’s Law, for these teams, acts in their favor rather than against them.

Bringing these agile-minded hackers, designers, and product managers to the civil service will be a challenge. Positioning itself as an attractive employer for top technical talent might prove difficult for the public sector: promising engineering students are often hired by Silicon Valley companies before they graduate, with salaries over 120k/year.

Money is not the only problem. To capture great engineers, working for the government needs to be cool. Nicholas Negroponte touched upon this issue in his talk at the 30th Anniversary event of the MIT Media Lab: “[Parents] don’t encourage their kids to become a civil servant. It was never the cool stuff […]. There’s been a swing toward too much startup, too many little apps companies. Suddenly, too few students who graduate are worrying about big, hard problems, because they can do an app”.

Governments are starting to realize that they need to incorporate 21st-century practices: The United Kingdom created the Government Digital Service in 2011, and the US followed suit in 2013 with the creation of 18F, a digital services agency that’s run like a startup. Coincidentally, the failure of both countries’ healthcare website projects prompted the creation of these agencies.

The public sector should not wait for a catastrophe to happen to reform the way they build and acquire technology. Products out of Silicon Valley have raised the bar; users now expect to have fluid and engaging experiences when interacting with an online system. Technologists working in the public sector must be empowered, and their skills nurtured. Their work shouldn’t be considered a mere implementation detail of a public policy. After all, as the healthcare.gov debacle showed, a poor user experience and bad technical decisions can undermine years of political struggle.


My MIT Media Lab Statement of Objectives

Even if the MIT Media Lab admission process is unique in many aspects, as most graduate programs they require an Statement of Objectives. Back in 2014, when I decided that I wanted to apply, I didn’t anticipate that writing one was going so be that difficult; I went over dozens of revisions, begged my friends and family for feedback, checked my grammar, and back again.

I now count myself among the lucky ones that were admitted for the Media Arts and Sciences graduate program in 2014. Here, I’m sharing my attempt at sounding smart-but-not-cocky and correct-but-not-boring, hoping that it’ll help those who are thinking of coming to study at this magical place.


One afternoon in 1987 my father came home after work with a present for me: it was a Czerweny 1500, a cheap clone of the venerable Sinclair ZX81. My friends at school played games on their powerful Commodore 64s and MSXs, but my little computer —made in Argentina— would only let me type instructions; I was fascinated by that. I have been a hacker since then and have worked as a professional software developer for more than 15 years.

I was born in the midst of Argentina’s last military dictatorship into a politically involved family. With the return of democracy in 1983, my parents went back to politics and civic activism. I was raised in demonstrations and heated political arguments around the family table. Such and upbringing led me to significant political involvement throughout my high school and university years. As I have grown professionally, I have been applying my practical software development skills to civic problems in Argentina.

As one of the 2013 Knight-Mozilla OpenNews fellows, I had the the fortune of visiting the Media Lab, specifically the Center for Civic Media, twice in the past year. I am of course highly aware of the work and projects pursued by the Lab, but it was through conversations with its students that I fully realized its unique, pan-disciplinary approach to problem solving and research. I could feel the atmosphere of creative chaos that fuels the Lab. As I reflect on my recent visits, I suspect that the lab would be a second home to me; my own life and professional history are equally as diverse, protean, and a bit chaotic.

My projects focus on the intersection of media, politics, activism, data visualization, and technology. I like to think that they share a common thread: making data useful for people by building tools that can better inform our decisions and foster meaningful discussions and actions. As the market pushes computers towards becoming glorified television sets, networks to be one-way content delivery channels, and data to serve the needs of marketing departments and intelligence agencies, we need to take back the vision of the true pioneers of personal computing, like Alan Kay and Seymour Papert, who imagined and built computers as tools for thought and action, and not as devices for mere consumption and surveillance.

I believe that knowledge of the res publica and effective action in the public sphere is no longer achievable without technology. No one yet knows how the mechanisms and dynamics of the so-called new civics will materialize, but the projects that have come out of Center for Civic Media, MacroConnections, and Social Computing are offering the world a glimpse of how these tools are going to impact our society. I want to help shape the future of civic participation, data analysis, and meaningful social interaction through technology, by doing what I do best: designing and building tools.

My skills as a tool maker, curiosity, and willingness to learn allowed me to be one of the first employees of Satellogic1, an Argentine aerospace startup conceived in the Singularity University graduate studies program. During my two-year tenure at the company, my portfolio of work included designing and implementing simulations, writing hardware drivers and developing ground support software for the CUBEBUG-1 nanosatellite mission2, the first cubesat built in Argentina, grant-funded by the Ministry of Science and Technology.

Throughout those two years working in Bariloche, a beautiful town in Patagonia, I experienced the most intense personal, intellectual, and professional growth of my life. The simulator that I created allowed the team to more deeply understand the coverage patterns of the spacecraft’s instruments. While working on the simulator, I grew my understanding of optics and Kepler’s laws of motion. I needed to become proficient with real-time operating systems, as I was tasked with writing the driver for the onboard communications module. I’ve been a licensed radio amateur for almost 20 years, so I was naturally brought into building the ground station and even climbed its 60 feet mast to install the antenna that I assembled. A nearby volcano erupted3 while I was living in Bariloche and I broke my wrist the first time I tried my luck at snowboarding down the edge-of-town slopes. I also got married to Luisina, my beautiful wife, who is a very talented documentary filmmaker and screenwriter.

The greatest achievement of the CUBEBUG-1 team was showing that is possible to bring agility and the hacker spirit to the aerospace industry. The satellite was designed, built, and successfully launched in less than a year by a group of six people that had not worked in aerospace before. The company is going strong: they’ve recently put a more complex cubesat4 into low earth orbit.

It was only recently when I realized that I could apply my technical skills to social issues as well. Around July 2010, the media in Bahía Blanca, Argentina —where I was born— was reporting on several corruption scandals within the local government. However, they weren’t tapping a rich source of information made available by the city: a daily feed of procurement data. I thought that was partly because it was not easy to consume and interpret, so I set out to build Gasto Público Bahiense5, a tool that scrapes the municipality’s website and re-publishes that information in a way that is easier to use and understand.

My personal project had an immediate impact on the local media and political landscape. For instance, the local press started using it as a source of information6, contractors used the site to check on their competitors’ past bids, citizens were made more aware of their government’s procurement decisions. By initiative of the opposition party, the City Council declared it “of municipal interest”7 by a majority vote. It was also featured by several national and international news outlets8.

The executive branch of the city’s government didn’t acknowledge the existence of Gasto Público Bahiense until one year after it was launched, when they redesigned their site and put a CAPTCHA in front of the section where the procurement data was published, thus making it inaccessible to web crawlers and scrapers. What was deceitfully justified by the city’s former secretary of finance as a “security and accessibility measure”9, was a political decision: the government didn’t want that data to be analyzed with tools other than theirs. This measure backfired on them, as it brought more notoriety to my project10. Evgeny Morozov devoted a section of his latest book to this case11, where he uses it to illustrate his views on the relative importance of opening government databases. Personal Democracy Media’s TechPresident recently reported on this and other projects that I developed.12

After an administration change last year, the government reversed their position towards Gasto Público Bahiense. Responding to the people’s newly acquired awareness of public information resources, the new mayor created an Innovation and Open Government agency13, whose appointed officer allowed us access to the procurement data feed through a web-service and included GPB in their own data portal. We’re now collaborating on an initiative to help smaller municipalities open their budget and procurement data by using their own technology and my project together.

This year, I was a speaker in the biggest Argentine TEDx event where I gave a talk to an audience of 1,500 about the story, challenges, and impact of Gasto Público Bahiense and other projects I developed during the last few years.14

My newfound interest in civic activism, public information, and technology brought me to being selected as one of the 2013 Knight-Mozilla OpenNews fellows. I worked in the newsroom of the leading Argentine daily La Nación, where I participated in journalistic investigations, built interactive visualizations that were published alongside major stories, and trained the newspaper’s design, data journalism, and technology staffs in the use of tools like D3.js and CartoDB.

I developed Tabula15, an open source tool for extracting tabular information out of PDF files —a pervasive problem in data journalism—, also during my OpenNews fellowship. After its initial release in April 2013, it garnered a lot of attention within the data journalism16 and open data communities. It was instrumental in the building of ProPublica’s Dollars for Docs17. Both The New York Times and ProPublica now support the development of Tabula by devoting staff time to the project.

In the same spirit of data liberation —inspired by ProPublica’s Free the Files18 and The Guardian’s MP’s Expenses19— I’m building a tool20 for creating crowdsourced document transcription efforts, geared to the needs of media organizations and grassroots activism. La Nación chose this tool to build its own crowdsourced transcription platform, that will be launched at the beginning of 2014.

Hearing from people who have used the tools I’ve built, like the Chicago Public Schools Apples 2 Apples collective who used Tabula for building SchoolCuts.org21 and successfully used data to engage a community, taught me that is possible to improve the people’s understanding of their government and enable meaningful action. I want to draw upon those lessons. I want to study how can we use budget, spending, and procurement data for social change and civic activism. Participatory budgeting processes, for instance, could benefit from tools that would let citizens use that information to simulate budgeting and spending scenarios, propose changes, and work together towards reaching an agreement. I agree with Alan Kay in that the computer revolution hasn’t happened yet, as we still haven’t reached the full potential of personal computing for education, dynamic simulations of complex phenomena, and as tools for “new powerful argumentation”22. I believe that his vision can be also applied to the public sphere, by tackling the hard problem of the asymmetry of access to analytical tools.

The concepts behind projects like Civic Media’s Action Path, Promise Tracker, and Data Therapy are in perfect synergy with how I imagine the future mechanisms of civic participation. I fantasize about a Latin American version of Media Cloud and how it could help make sense of the very complex relationship between the press and current Latin American governments23.

I’m inspired by MacroConnections’ DataViva.info, a project that set new standards for government information repositories and showed how to turn complex public data into knowledge that can be used by everyone. I’ve shown Immersion to all of my friends, and witnessed how they changed their minds about what electronic surveillance really means.

I hesitate to call myself a hacker; that is a title that is bestowed upon by other hackers. But I do recognize in me the main characteristic of the hacker mindset: I learn by building, I think by doing.

My desire is to perpetually grow and be a better learner, builder, thinker, and doer. The MIT Media Lab is the best place I can think of to do that. I hope that you agree.

  1. Satellogic.com 

  2. CUBEBUG-1. Launched on April 2013. Its on-board software was open-sourced, an example of my work can be seen in the communications module driver

  3. 2011 Puyehue-Cordón Caulle eruption — Wikipedia. Retrieved 12/2/2013 

  4. CUBEBUG-2 — Launched on November 2013. 

  5. Gasto Público Bahiense: http://gastopublicobahiense.org (source on Github

  6. Early example: Story on advertisement expenditure by the city government on a local news site. 

  7. Text of the declaration (in spanish) 

  8. Global Voices: “Argentina: Hackathons and budget transparency in Bahía Blanca” 

  9. TV Interview with the former Secretary of Finance about this matter (in spanish) — July 6th, 2011 

  10. Articles in major argentine newspapers: La Nacion, Perfil (in spanish) 

  11. Morozov, E. Bad for the databases, good for democracy?. In To Save Everything Click Here. 

  12. Chao, R. Buenos Aires, A Pocket of Civic Innovation in Argentina — TechPresident 

  13. Secretaría de Innovación y Gobierno Abierto — Bahía Blanca (in spanish) 

  14. TEDx Rio de La Plata: Cambiando el mundo de a una línea de código por vez (Video - spanish) 

  15. Tabula — Released under the MIT license (source on github

  16. Weiss, J. *How news organizations are using Tabula for data journalism. * 

  17. Dollars for Docs - ProPublica 

  18. Free The Files - ProPublica 

  19. MP’s Expenses - The Guardian 

  20. CrowData — Released under the MIT license (source on github

  21. SchoolCuts.org “created to share information about the 129 schools under consideration for closing.” 

  22. Viewpoints Research Institute memo M-2007-007-a — Retrieved 12/1/2013 

  23. Goñi, U. “Argentina’s media empire Clarín told to sell off holdings by supreme court — The Guardian” — Retrieved 5/12/2013 


IPython, matplotlib y transporte público en Bahía Blanca

Hace un tiempo, la ciudad de Bahía Blanca introdujo tarjetas de proximidad para el pago de la tarifa del servicio de transporte público. Junto con el sistema AVL con el que cuentan las unidades, los registros de uso de las tarjetas de proximidad son una fuente de información valiosa.

Como excusa para aprender un poco más sobre IPython Notebook, matplotlib y Basemap, estuve jugando con aproximadamente 4.3 millones de registros del sistema de transporte público bahiense, puestos a disposición por la Agencia de Innovación y Gobierno Abierto de la ciudad.

El notebook completo se puede ver acá: Datos de Automated Fare Collection del sistema de transporte público de Bahía Blanca

La información

Los campos más importantes de los registros sobre los que trabajamos son:

  • Fecha y Hora — momento en que se registró la transacción
  • ID tarjeta — identificador numérico, único y no vinculable con los datos personales del usuario
  • Línea Ómnibus — Línea (servicio) de ómnibus
  • Locación — Lectura del GPS de la unidad al momento de realizarse la transacción
  • Tipo Pasaje — Normal, frecuente, escolar, etc.

Una vez procesados y guardados en una tabla de una base de datos PostgreSQL/PostGIS, podemos empezar a hacer algunos análisis simples.

Promedio de viajes por hora

Podemos ver, por ejemplo, la cantidad promedio de viajes efectuados en cada hora del día para los días hábiles de la semana. El pico de actividad en el mediodía, quizás se deba al horario comercial “cortado” que se acostumbra en Bahía Blanca y otras ciudades del interior.

Cantidad promedio de viajes por ahora, durante los días hábiles

“Perfil” de usuario.

Es razonable considerar análisis que requieran tipificar usuarios en base a sus patrones de uso. Una posible forma de construir estos perfiles es la siguiente:

Perfil de uso semanal para un usuario

Perfil de uso semanal para un usuario

Los gráficos muestran la cantidad de viajes por hora y día de la semana para un período determinado. En el primero vemos uso consistente alrededor de las 7 de la mañana y de las 6 de la tarde.

Viajes encadenados

Nos interesa ver las combinaciones más frecuentes de dos líneas de ómnibus. Esta estadística puede indicarnos qué áreas de la ciudad no están bien conectadas por una única línea.

Decimos que los viajes v1 y v2 del pasajero p están encadenados si:

  • La diferencia de tiempo entre v2 y v1 es menor a 45 minutos
  • v1 y v2 fueron realizados en diferentes líneas (o sea, no consideramos viajes de retorno hacia el punto de partida)

Graficamos la matriz de combinaciones de líneas de omnibus:

Matriz de cantidad de viajes entre líneas

También es interesante ver dónde comienzan sus viajes los usuarios que realizan la combinación 514-517, una de las más frecuentes:

viajes 514-517


Boletín Oficial: el log del estado

La metáfora surgió en el ya legendario primer hackatón de información pública que organizamos en GarageLab hace casi 4 años: así como el software suele informar su actividad en un log, el estado hace lo mismo a través de los boletines oficiales que publica diariamente. La versión del estado nacional estuvo dividida históricamente en tres secciones: Legislación y avisos oficiales, sociedades, contrataciones. Hace pocas semanas se incorporó una cuarta sección, donde se publican los nuevos nombres de dominio registrados a través de NIC.ar.

Desde GarageLab experimentamos con el problema de recuperar, estructurar y analizar la información contenida en las primeras 3 secciones del Boletín. Durante la edición 2011 de Desarrollando América Latina, los miembros de Banquito desarrollaron un prototipo de scraper e interfaz de consulta para la tercera sección, mientras que Damián Janowski y yo trabajamos en un prototipo de scraper y named entity recognizer para la sección de sociedades.

En preparación para el hackatón panamericano La Ruta del Dinero que se va a hacer este sábado 7 de junio de 2014, retomé las ideas con las que estuvimos jugando hace unos años. La composición de los directorios, novedades, quiebras y edictos de las empresas registradas en el país son una fuente de información importante para seguir la ruta del dinero.

boletinoficial.gov.ar

Además de ser publicado en papel y en PDF, el Boletín Oficial tiene un sitio web que publica la información de manera más o menos estructurada. Su usabilidad y performance dejan bastante que desear, pero es bastante fácil de scrapear gracias a que las páginas se generar a partir de un servicio que emite documentos XML (Ver ejemplo)

Es decir, el sistema que publica el Boletín en la web, almacena los datos estructurados pero no nos ofrece la posiblidad de obtenerlos de esa manera.

Entonces hay que scrapear

Publiqué en GitHub un script en Python que obtiene los avisos de la segunda sección para una fecha dada y emite un archivo CSV: https://github.com/jazzido/boscrap.

Para usarlo, bajar el contenido del repositorio, instalar las dependencias con pip install -r requirements.txt y ejecutar:

python boscrap.py 2014-06-04